During the years of working with telecom data my folder with code snippets collected a lot of reusable examples. And it is not about “SELECT * FROM Table1”, I am talking about finding and handling or removing duplicate values, selecting top N values from each group of data within same table, shuffling records within the groups, but keep groups sorted, finding the longest left match, expanding numbers by N digits and so on.
Today I would like to share those snippets and explain how it all works. I work with Microsoft SQL Server and use T-SQL in my queries, but I am sure it is possible to find alternatives for other DBMS. If not, reach me out and we will try to find it together 🙂
For this article I have prepared the table of fake calls with fake numbers and real countries, so if you try to repeat the same transaction it won’t work 😀 (ok, except for UK, they really have +44 country code, but I am not sure the number I used exists)
Dealing with Duplicates
The fastest and the easiest way to select only unique values from the table is by using DISTINCT keyword
SELECT DISTINCT *
FROM [test].[dbo].[12162018_FakeCalls]
By indicating the columns we want to retrieve, we can select unique values based on numbers and countries only.
SELECT DISTINCT NumberFrom, CountryFrom, NumberTo, CountryTo
FROM [test].[dbo].[12162018_FakeCalls]
But we lose information from other columns.
Also we can find duplicates with GROUP BY, HAVING and COUNT(*)
SELECT NumberFrom, CountryFrom, NumberTo, CountryTo, count(*) as dups
FROM [test].[dbo].[12162018_FakeCalls]
GROUP BY NumberFrom, CountryFrom, NumberTo, CountryTo
HAVING COUNT(*) > 1
ORDER BY dups desc
I know, in SQL database duplicates should not appear at all, you should use unique ids, constrains and indexes. But sometimes you just performing some tests, or doing some computations for the first time, experimenting – and duplicates appeared. Or your records have distinct ids, although you understand that one of them should be removed. What would you do in these cases? A long time ago I found this solution on stackoverflow, saved it and from then I use it at least few times per day (duplicates are quite common, job specifics). WITH statement and ROW_NUMBER() are my saviors.
WITH CTE AS(
SELECT NumberFrom, CountryFrom, NumberTo, CountryTo, Duration, id,
RN = ROW_NUMBER()OVER(PARTITION BY NumberFrom, CountryFrom, NumberTo, CountryTo ORDER BY CountryFrom)
FROM #t
)
SELECT * FROM CTE WHERE RN > 1
I have copied the original data into temporary table #t, because “Undo” doesn’t work after delete, heh 🙂 and we see the same result as with the GROUP BY, although with all columns. And now we can perform a DELETE here and remove all the duplication values based on our definition of duplication.
WITH CTE AS(
SELECT NumberFrom, CountryFrom, NumberTo, CountryTo, Duration, id,
RN = ROW_NUMBER()OVER(PARTITION BY NumberFrom, CountryFrom, NumberTo, CountryTo ORDER BY CountryFrom)
FROM #t
)
DELETE FROM CTE WHERE RN > 1;
SELECT *
FROM #t
PARTITION BY will serve as the duplication definition in this case and with ORDER BY we can, for example, remove calls that were shorter/longer in duration or those that started earlier/later or whatever condition you may found in other columns. Please note, ORDER BY clause is a must here, so if you don’t have any specific condition for ordering just put any field there.
Selecting Top N Records Within the Group
I believe you already know how to do that :D. Yes, with ROW_NUMBER() again. This function will partition our data and assign a number to each record within a group, so then we just have to specify which range of top N we want to select: top 10? – RN <= 10, from top 3 to top 7? - easy, RN between 3 and 7 or RN >= 3 and RN <= 7.
Let’s select only top 2nd record within the groups of countries from which the calls were made (‘CountryFrom’) that have their names starting with the same letter and sort values by duration within the groups. (I have inserted the same data into temp table 3 times to better see the differences)
WITH CTE AS(
SELECT NumberFrom, CountryFrom, NumberTo, CountryTo, Duration, id,
RN = ROW_NUMBER()OVER(PARTITION BY LEFT(CountryFrom,1) ORDER BY Duration ASC)
FROM #t
)
SELECT *
FROM CTE;
Look how Germany and Greece are put in one group and Ireland with Italy into another. And to select only second record within each group:
WITH CTE AS(
SELECT NumberFrom, CountryFrom, NumberTo, CountryTo, Duration, id,
RN = ROW_NUMBER()OVER(PARTITION BY LEFT(CountryFrom,1) ORDER BY Duration ASC)
FROM #t
)
SELECT *
FROM CTE
WHERE RN = 2;
Shuffling Records Within the Group
Actually this technique can be used to shuffle records in any case, not only within the group, but as we already started with groups, ROW_NUMBER and WITH statements I decided to make it little bit more complicated. I don’t know if in any period of your life you will need to shuffle records within a group, but if that one occasion appears you know how to do that. Again, this is part of my day-to-day job and I know it’s a rare task :)) So back to queries. We will use the same grouping condition and for shuffling it’ll be a combination of functions ABS() – function that is used to get the absolute value of a number passed as an argument, CHECKSUM() – function that returns the checksum value computed over a table row, or over an expression list, and NEWID() – that creates a unique value of type uniqueidentifier. While reading you might think – what the hell? But I’m telling, it does the job. Look:
WITH CTE AS(
SELECT NumberFrom, CountryFrom, NumberTo, CountryTo, CallStarted, Duration, id,
RN = ROW_NUMBER()OVER(PARTITION BY LEFT(CountryFrom,1) ORDER BY ABS(CHECKSUM(NewId())) % 1000000 DESC)
FROM [test].[dbo].[12162018_FakeCalls]
)
SELECT * FROM CTE
It might be not that obvious with small amount of data, but the most exiting thing is every time you run this query it shuffles again. So after a second run:
You may noticed now POLAND is the first record and SPAIN is the second.
That’s it for today. Few tips with ROW_NUMBER() that help me a lot in my work and I hope will help someone else too.
Despite it doesn’t include machine learning, artificial intelligence, image recognition, model tuning, deep neural networks or any other fancy terms that pop up in data science field, working with data using SQL is still part of data science. And there are still a lot of data professionals that work with this language and relational databases. Let’s do not forget our roots, my fellows.
Thanks for reading, be cool, professional and have a fantastic day.
Photo by Kalen Emsley on Unsplash
Gracias me ayudo mucho tu articulo
De nada 🙂
Hola,
Tengo un problema similar pero no logro solucionarlo. Ayúdame por favor
Mándame un correo y lo podemos mirar juntos 😉