ROW_NUMBER is one of the most valuable and versatile functions in SQL. It can be leveraged for different use cases, from ranking items, identifying data quality gaps, doing some minimization, handling preference queries, or helping with sessionization etc.
The ROW_NUMBER function isn’t, however, a traditional function. It is a window function. Window functions are an advanced kind of function, with specific properties. This article aims to go over how window functions, and more specifically, how the ROW_NUMBERfunction work, and to go over some of the use cases for the ROW_NUMBER function.
To understand how a window function work, it is essential first to understand, what type of arguments it can take. The following is the syntax for providing an argument using the window function.
As you can see, the row number doesn’t take a direct argument. We don’t have a ROW_NUMBER(a.columna) , for instance, but takes arguments in the OVER clause. The argument it takes is called a window. The window defines a subset of the dataset to use for the computation. Different arguments can be used to define this window, partitions, orders, rows between.
Other window functions may also include direct arguments like traditional functions, such as the SUM window function, e.g.: SUM(amount) OVER (window) , in which case we would be summing the amount over a subset of the data as defined by the window.
The PARTITION BY argument allows us to split the dataset. The typical way to uses it is to specify the list of columns on which we would like to start a new count on:
The above statement would, for instance, gives us, for each client, a row number from 1 to n (number of client in the city).
The partition by clause can, however, accept more complicated expressions. This is better shown using a SUM window function rather than a ROW_NUMBER function.
The result of the query is the following:
What the query does is handling the SUM with a partition set for t=1, and another for the rest of the query (NULL). The split between the dataset happens after the evaluation from the case statement query.
Since for t=1, we have one partition equal to t (which can have only one value), we have a first group (sub-dataset) to do computation on. For t != 1, there is only one option is to be part of the group with an output as a NULLvalue. Since this group is composed of 2 records with t=2 and one record with t=3, the sum for the group is equal to 7.
If we replaced the window function with the following:
We would generate three groups to split the data into t=1, t=2, and t>2. The respective sums would be 1,4 and 3.
The same type of operations can also be performed to compute the row numbers.
The order by argument will define, for the purpose of this specific function, how the dataset will be sorted. A simple ROW_NUMBER query such as the following will only be providing a sorted dataset by value with the associate row_number as if it was a full dataset:
The ORDER BY window argument can like the general query order by support ascending (ASC) or descending modifiers (DESC).
Other supported modifiers are related to the treatment of null values. We can select if null values should be considered first (NULLS FIRST)or last (NULLS LAST).
Some dialects, such as T-SQL or SQLite, allow for the use of aggregate functions within the window for ordering purposes.
The result of which is the following:
This, however, requires the use of a group by aggregation.
Besides the partition by clause, it is also possible to specify the subset of data to use based on what is called a frame specification. The frame specification will either take a subset of data based on the row placement within the partition or a numeric or temporal value.
The frame specification is typically placed after a ORDER BY clause, and is generally started with either a ROW or RANGE operator. Each takes an indication of how many units before and after the current row to use to calculate the output of the function. Some examples of this are ROWS 5 PRECEDING AND 1 FOLLOWING , RANGE 1 PRECEDING AND CURRENT ROW or RANGE INTERVAL 5 DAY PRECEDING AND 0 DAY FOLLOWING. It can also take unbounded arguments, for example:ROWS UNBOUNDED PRECEDING AND CURRENT ROW. An example query making use of this frame specification is provided below using a SUM window function for illustrative purpose:
When leveraging multiple window functions in the same query, it is possible to render its content through a window alias. Windows can be aliased defining them after the HAVING statement (if used) or if not used, a used statement occurring just before in the SQL evaluation order (FROM/WHERE/GROUP BY). An example of window aliasing is shown below:
One of the typical use cases of the ROW_NUMBER function is that of ranking records. Other functions exist to rank values in SQL, such as the RANK and DENSE_RANK functions. It is essential to understand their particularities and differences.
An example query shows how the different functions would behave:
The uniqueness property of ROW_NUMBER is one of its’ most significant advantages. Since we know that there can be at most one record for a given value of a ROW_NUMBER We should be able to join on it within worrying about cardinality safely.
The ROW_NUMBER function can be used for minimization or maximization on the dataset. Using, it is possible to get some ARG MAX.
Let say we have been asked to find the vehicle that has been able to travel the fastest between the route of Paris to Amsterdam. We are interested in knowing the model and brand of the car that traveled the fastest. We can use the ROW_NUMBER function to help us in this calculation. The following query would provide us with this type of calculation:
There can be cases where it is needed to have some mutually exclusive preference across the records. For instance, if you are provided a list of users’ contact details, and need to select them in the most cost-effective manner, preferring, for instance, to send them an email rather than giving them a phone call or preferring to phone them rather than to send them a snail mail.
It is possible to implement these types of queries without window functions. However, this can lead to relatively long, complex, and inefficient queries. Take a look at the following query:
Using the ROW_NUMBER window function, this query can be better expressed using a preference query:
This approach has the following advantages:
Short: The query is significantly more condensed than without a ROW_NUMBER window function, making it easier to read or modify as requirements evolve.
Simplicity: The query in itself is expressed in quite a simple way; no need to go back and forth to understand what is getting filtered or combined at different steps in the process.
Performance: In this query, instead of doing three pass-through the data + needing to join on these different tables, we merely need to sort through the data to obtain the records that we seek.
ROW_NUMBER provides one of the best tools to deduplicate values, for instance, when needing to deal with duplicate data being loaded onto a table. It allows us to select only one record from each duplicate set. To deduplicate, the critical thing to do is to incorporate all the fields that are meant to represent the “uniqueness” within the PARTITION BY argument:
In some cases, we can leverage the ROW_NUMBER function to identify data quality gaps. This is the case, for instance, when leveraging clickstream data making use of a “hit number” indicator. Vendor provided solutions, such as Google Analytics, to make use of the “hit count” generated client-side. These “hits” represent events that need to be sent to the server. Missing hits numbers therefore represent some events that should have been sent but did not end up being collected in the database.
Sometimes, it is possible to reconstruct these events artificially. The ROW_NUMBER function helps to identify where these data gaps occur. A test can be implemented leveraging the ROW_NUMBER and LAG window functions, to identify events within the data that first come out of sequence. The LAG window function takes the N preceding value (by default 1) in the window. This is exemplified in the following query:
After having identified the events that are “out of sync,” it is possible to do a second pass on the dataset to apply a transformation fix. This is typically done by looking at the previous row available (preceding RN) and the current row to generate the artificial events that should have happened or were likely to have occurred.
Another place where ROW_NUMBER can help is in performing sessionization. Different rules can be implemented to generate the sessionization. One of the most straightforward rules is that the session needs to happen on the same calendar day.
An example of how we can use the ROW_NUMBER function to create this event sessionization is provided in the query below:
ROW_NUMBER is one of the most useful SQL functions to master for data engineers. It has a wide range of applications and often provides a simple path to handle some of the typical data engineering problems such as deduplication, sessionization, or dealing with preference queries.