Python Sucks

SQL window functions. Basic

Feb. 22, 2024

Pretty often SQL is used to calculate different metrics or to aggregate values and measurements. Alongside with aggregating functions there are windows functions, which are used widely.

As an example we're going to use a small table with sudents' subjects & marks. This is how the table looks like in the DB

create table student_grades 
(
    name varchar,
    subject varchar,
    grade int
);

insert into student_grades
values
    ('Peter', 'english', 4),
    ('Peter', 'physics', 5),
    ('Peter', 'history', 4),
    ('Mary', 'math', 4),
    ('Mary', 'english', 4),
    ('Mary', 'physics', 5),
    ('Mary', 'history', 3);

select * from student_grades;

name	subject	grade
Peter	english	4
Peter	physics	5
Peter	history	4
Mary	math	4
Mary	english	4
Mary	physics	5
Mary	history	3

Window function in SQL - is a function, which works with a set of rows (aka window, partition) and makes calculations for the set in a separate column.

Partitions (windows) - is a set of rows specified for a window function with a column(s) of the table. Paritions for every window function in a query can be divided into different table columns.

name	subject	grade
Peter	english	4
Peter	physics	5
Peter	history	4
Mary	math	4
Mary	english	4
Mary	physics	5
Mary	history	3

Partitions by the name field

What is the main difference between window functions and aggregate grouping functions?

When using the aggregate functions the statement GROUP BY reduces the number of records in result by grouping them.

name	subject	grade
Peter	english	4
Peter	physics	5
Peter	history	4
Mary	math	4
Mary	english	4
Mary	physics	5
Mary	history	3

name	average grade
Peter	4.3
Mary	4

Using the window functions doesn't reduce the amount of records in comparison to the initial table.

name	subject	grade	average grade
Peter	english	4	4.3
Peter	physics	5	4.3
Peter	history	4	4.3
Mary	math	4	4
Mary	english	4	4
Mary	physics	5	4
Mary	history	3	4

The order of calculating window functions in an SQL query.

SELECT	list of columns, window functions
FROM	table, joint tables, subquery
WHERE	filtering clause
GROUP	list of columns
HAVING	aggregation filtering clause
ORDER BY	list of columns / window functions

First of all, the command for selecting tables, joining them, and possible subqueries under the FROM command is executed.
Then goes WHERE filtration, grouping with GROUP BY and possible HAVING filter
After that column selection command is applied alongside with the calculating of window functions
At the end ORDER BY sorting is executed - here you can specify a window function column as well

It's important to remember that partitions of window functions are created after the table is grouped with GROUP BY command (in case if the last one participate in a query).

Syntax of window functions

Syntax of window functions consists of identical commands (regardless of the function type).

Function name	>	FUNCTION_NAME(column_name)
Filter expression (optional)	>	[FILTER (WHERE filter_clause)]
Keyword definition	>	OVER
		(
Declaring a partition by columns	>	PARTITION BY (column_names),
Sorting Window Function сalculation	>	ORDER BY (column_names)
Frame for the partition	>	[frame_clause]
		)

Window functions can be declared either inside the SELECT command or with a special keyword WINDOW, where the window gets an alias, which we can point to from the SELECT section.

SELECT name, subject, grade,
    row_number() over (partition by name order by grade desc),
    rank() over (partition by name order by grade desc),
    dense_rank() over (partition by name order by grade desc)
FROM student_grades;

-- is equal to

SELECT name, subject, grade,
    row_number() over name_grade,
    rank() over name_grade,
    dense_rank() over name_grade
FROM student_grades
WINDOW name_grade as (partition by name order by grade desc);

Categories of window functions

Aggregate	Ranking	Value
AVG()	DENSE_RANK()	FIRST_VALUE()
COUNT()	NTILE()	LAST_VALUE()
MAX()	RANK()	LAG()
MIN()	ROW_NUMBER()	LEAD()
SUM()	CUME_DIST()	NTH_VALUE()

Aggregate functions

SELETE name, subject, grade,
    sum(grade) over (partition by name) as sum_grade,
    avg(grade) over (partition by name) as avg_grade,
    count(grade) over (partition by name) as count_grade,
    min(grade) over (partition by name) as min_grade,
    max(grade) over (partition by name) as max_grade
FROM student_grades;

name	subject	grade	sum_grade	avg_grade	count_grade	min_grade	max_grade
Peter	english	4	13	4.3	3	4	5
Peter	physics	5	13	4.3	3	4	5
Peter	history	4	13	4.3	3	4	5
Mary	math	4	16	4	4	3	5
Mary	english	4	16	4	4	3	5
Mary	physics	5	16	4	4	3	5
Mary	history	3	16	4	4	3	5

Ranking functions

In ranking functions under the OVER keyword the condition ORDER BY is required.

ROW_NUMBER() - the function calculates a number of rows inside the partition regardless of duplicate values.
RANK() - the function calculates a rank of every row inside the partition. If there are duplicates, the function returns the same rank for every duplicate skipping the next rank number.
DENSE_RANK() - the same as RANK, but doesn't skip the next rank number.

SELECT name, subject, grade,
    row_number() over (partition by name order by grade desc),
    rank() over (partition by name order by grade desc),
    dense_rank() over (partition by name order by grade desc)
FROM student_grades;

name	subject	grade	row_number	rank	dense_rank
Peter	physics	5	1	1	1
Peter	english	4	2	2	2
Peter	history	4	3	2	2
Mary	physics	5	1	1	1
Mary	math	4	2	2	2
Mary	english	4	3	2	2
Mary	history	3	4	4	3

NULL values are combined into the same partition

Value functions

Those are the functions, which go through the partition and returns border values of the partition or previous/next values.

LAG() - a function, which returns the previous value of the column according to the sort order.
LEAD() - a function, which returns the next value of the column according to the sort order.

A small example of how to get current grade and grades from next/previous semesters

create table grades_quartal (
    name varchar,
    semester varchar,
    subject varchar,
    grade int
);

insert into grades_quartal (
values 
    ('Peter', '1 semester', 'physics', 4),
    ('Peter', '2 semester', 'physics', 3),
    ('Peter', '3 semester', 'physics', 4),
    ('Peter', '4 semester', 'physics', 5)
);

select * from grades_quartal;

name	semester	subject	grade
Peter	1 semester	physics	4
Peter	2 semester	physics	3
Peter	3 semester	physics	4
Peter	4 semester	physics	5

SELECT name, quartal, subject, grade, 
    lag(grade) over (order by semester) as previous_grade,
    lead(grade) over (order by semester) as next_grade
FROM grades_quartal;

name	semester	subject	grade	previous_grade	next_grade
Peter	1 semester	physics	4	NULL	3
Peter	2 semester	physics	3	4	4
Peter	3 semester	physics	4	3	5
Peter	4 semester	physics	5	4	NULL

FIRST_VALUE()/LAST_VALUE() - functions, which returns first or last values of the column of the specified partition. ORDER BY condition is required to specify next to OVER