Pretty often SQL is used to calculate different metrics or to aggregate values and measurements. Alongside with aggregating functions there are windows functions, which are used widely.
As an example we're going to use a small table with sudents' subjects & marks. This is how the table looks like in the DB
create table student_grades
(
name varchar,
subject varchar,
grade int
);
insert into student_grades
values
('Peter', 'english', 4),
('Peter', 'physics', 5),
('Peter', 'history', 4),
('Mary', 'math', 4),
('Mary', 'english', 4),
('Mary', 'physics', 5),
('Mary', 'history', 3);
select * from student_grades;
name | subject | grade |
Peter | english | 4 |
Peter | physics | 5 |
Peter | history | 4 |
Mary | math | 4 |
Mary | english | 4 |
Mary | physics | 5 |
Mary | history | 3 |
Window function in SQL - is a function, which works with a set of rows (aka window, partition) and makes calculations for the set in a separate column.
Partitions (windows) - is a set of rows specified for a window function with a column(s) of the table. Paritions for every window function in a query can be divided into different table columns.
name | subject | grade |
Peter | english | 4 |
Peter | physics | 5 |
Peter | history | 4 |
Mary | math | 4 |
Mary | english | 4 |
Mary | physics | 5 |
Mary | history | 3 |
Partitions by the name field
What is the main difference between window functions and aggregate grouping functions?
When using the aggregate functions the statement GROUP BY reduces the number of records in result by grouping them.
name | subject | grade |
Peter | english | 4 |
Peter | physics | 5 |
Peter | history | 4 |
Mary | math | 4 |
Mary | english | 4 |
Mary | physics | 5 |
Mary | history | 3 |
name | average grade |
Peter | 4.3 |
Mary | 4 |
Using the window functions doesn't reduce the amount of records in comparison to the initial table.
name | subject | grade | average grade |
Peter | english | 4 | 4.3 |
Peter | physics | 5 | 4.3 |
Peter | history | 4 | 4.3 |
Mary | math | 4 | 4 |
Mary | english | 4 | 4 |
Mary | physics | 5 | 4 |
Mary | history | 3 | 4 |
The order of calculating window functions in an SQL query.
SELECT | list of columns, window functions |
FROM | table, joint tables, subquery |
WHERE | filtering clause |
GROUP | list of columns |
HAVING | aggregation filtering clause |
ORDER BY | list of columns / window functions |
- First of all, the command for selecting tables, joining them, and possible subqueries under the FROM command is executed.
- Then goes WHERE filtration, grouping with GROUP BY and possible HAVING filter
- After that column selection command is applied alongside with the calculating of window functions
- At the end ORDER BY sorting is executed - here you can specify a window function column as well
It's important to remember that partitions of window functions are created after the table is grouped with GROUP BY command (in case if the last one participate in a query).
Syntax of window functions
Syntax of window functions consists of identical commands (regardless of the function type).
Function name | > | FUNCTION_NAME(column_name) |
Filter expression (optional) | > | [FILTER (WHERE filter_clause)] |
Keyword definition | > | OVER |
( | ||
Declaring a partition by columns | > | PARTITION BY (column_names), |
Sorting Window Function Ńalculation | > | ORDER BY (column_names) |
Frame for the partition | > | [frame_clause] |
) |
Window functions can be declared either inside the SELECT command or with a special keyword WINDOW, where the window gets an alias, which we can point to from the SELECT section.
SELECT name, subject, grade,
row_number() over (partition by name order by grade desc),
rank() over (partition by name order by grade desc),
dense_rank() over (partition by name order by grade desc)
FROM student_grades;
-- is equal to
SELECT name, subject, grade,
row_number() over name_grade,
rank() over name_grade,
dense_rank() over name_grade
FROM student_grades
WINDOW name_grade as (partition by name order by grade desc);
Categories of window functions
Aggregate | Ranking | Value |
---|---|---|
AVG() | DENSE_RANK() | FIRST_VALUE() |
COUNT() | NTILE() | LAST_VALUE() |
MAX() | RANK() | LAG() |
MIN() | ROW_NUMBER() | LEAD() |
SUM() | CUME_DIST() | NTH_VALUE() |
Aggregate functions
SELETE name, subject, grade,
sum(grade) over (partition by name) as sum_grade,
avg(grade) over (partition by name) as avg_grade,
count(grade) over (partition by name) as count_grade,
min(grade) over (partition by name) as min_grade,
max(grade) over (partition by name) as max_grade
FROM student_grades;
name | subject | grade | sum_grade | avg_grade | count_grade | min_grade | max_grade |
Peter | english | 4 | 13 | 4.3 | 3 | 4 | 5 |
Peter | physics | 5 | 13 | 4.3 | 3 | 4 | 5 |
Peter | history | 4 | 13 | 4.3 | 3 | 4 | 5 |
Mary | math | 4 | 16 | 4 | 4 | 3 | 5 |
Mary | english | 4 | 16 | 4 | 4 | 3 | 5 |
Mary | physics | 5 | 16 | 4 | 4 | 3 | 5 |
Mary | history | 3 | 16 | 4 | 4 | 3 | 5 |
Ranking functions
In ranking functions under the OVER keyword the condition ORDER BY is required.
- ROW_NUMBER() - the function calculates a number of rows inside the partition regardless of duplicate values.
- RANK() - the function calculates a rank of every row inside the partition. If there are duplicates, the function returns the same rank for every duplicate skipping the next rank number.
- DENSE_RANK() - the same as RANK, but doesn't skip the next rank number.
SELECT name, subject, grade,
row_number() over (partition by name order by grade desc),
rank() over (partition by name order by grade desc),
dense_rank() over (partition by name order by grade desc)
FROM student_grades;
name | subject | grade | row_number | rank | dense_rank |
Peter | physics | 5 | 1 | 1 | 1 |
Peter | english | 4 | 2 | 2 | 2 |
Peter | history | 4 | 3 | 2 | 2 |
Mary | physics | 5 | 1 | 1 | 1 |
Mary | math | 4 | 2 | 2 | 2 |
Mary | english | 4 | 3 | 2 | 2 |
Mary | history | 3 | 4 | 4 | 3 |
NULL values are combined into the same partition
Value functions
Those are the functions, which go through the partition and returns border values of the partition or previous/next values.
- LAG() - a function, which returns the previous value of the column according to the sort order.
- LEAD() - a function, which returns the next value of the column according to the sort order.
A small example of how to get current grade and grades from next/previous semesters
create table grades_quartal (
name varchar,
semester varchar,
subject varchar,
grade int
);
insert into grades_quartal (
values
('Peter', '1 semester', 'physics', 4),
('Peter', '2 semester', 'physics', 3),
('Peter', '3 semester', 'physics', 4),
('Peter', '4 semester', 'physics', 5)
);
select * from grades_quartal;
name | semester | subject | grade |
---|---|---|---|
Peter | 1 semester | physics | 4 |
Peter | 2 semester | physics | 3 |
Peter | 3 semester | physics | 4 |
Peter | 4 semester | physics | 5 |
SELECT name, quartal, subject, grade,
lag(grade) over (order by semester) as previous_grade,
lead(grade) over (order by semester) as next_grade
FROM grades_quartal;
name | semester | subject | grade | previous_grade | next_grade |
---|---|---|---|---|---|
Peter | 1 semester | physics | 4 | NULL | 3 |
Peter | 2 semester | physics | 3 | 4 | 4 |
Peter | 3 semester | physics | 4 | 3 | 5 |
Peter | 4 semester | physics | 5 | 4 | NULL |
FIRST_VALUE()/LAST_VALUE() - functions, which returns first or last values of the column of the specified partition. ORDER BY condition is required to specify next to OVER