If the frame for the window is defined as a the range between 'unbounded preceding and current row' the value for the sum of x will the sum for the entire column of data. However, if the frame is defined as the rows between 'unbounded preceding and current row' it will sum up the values of the current row and the values of the rows that came before; 1, 1+2, 1+2+3. etc.
mysql> SELECT x, sum(x) over w as 'rows',
-> sum(x) over y as 'range'
-> from x
-> window w as
(rows between unbounded preceding and current row),
-> y as
(range between unbounded preceding and current row);
+------+------+-------+
| x | rows | range |
+------+------+-------+
| 1 | 1 | 55 |
| 2 | 3 | 55 |
| 3 | 6 | 55 |
| 4 | 10 | 55 |
| 5 | 15 | 55 |
| 6 | 21 | 55 |
| 7 | 28 | 55 |
| 8 | 36 | 55 |
| 9 | 45 | 55 |
| 10 | 55 | 55 |
+------+------+-------+
10 rows in set (0.00 sec)
The terminology of frames and windows gets a little confusing and sound like a home remodeling project. For now consider 'unbounded' to be everything in the column, so unbounded preceding would be everything in the column before this row.
mysql> SELECT x,
sum(x) over w as 'rows',
sum(x) over y as 'range'
from x
window w as
(rows between current row and unbounded following),
y as (range between unbounded preceding and current row);
+------+------+-------+
| x | rows | range |
+------+------+-------+
| 1 | 55 | 55 |
| 2 | 54 | 55 |
| 3 | 52 | 55 |
| 4 | 49 | 55 |
| 5 | 45 | 55 |
| 6 | 40 | 55 |
| 7 | 34 | 55 |
| 8 | 27 | 55 |
| 9 | 19 | 55 |
| 10 | 10 | 55 |
+------+------+-------+
10 rows in set (0.00 sec)
More frames and windows next time!
More State Stuff
Last time was the start of a series on windowing functions that were introduced with MySQL 8. This entry will delve a little bit deeper into this realm. The data used for these examples starts off with the World database.
Population By State
Previously the data showed a total population for each state but listed an entry for each city in that state, er, district. It is quite easy to get a per state total by adding a simple GROUP BY to the query.
mysql> SELECT district, Sum(Population) OVER w
FROM city
WHERE CountryCode = 'USA'
GROUP BY District
WINDOW w AS (PARTITION BY District)
LIMIT 10;
+----------------------+------------------------+
| district | Sum(Population) OVER w |
+----------------------+------------------------+
| Alabama | 242820 |
| Alaska | 260283 |
| Arizona | 1321045 |
| Arkansas | 183133 |
| California | 3694820 |
| Colorado | 554636 |
| Connecticut | 139529 |
| District of Columbia | 572059 |
| Florida | 735167 |
| Georgia | 416474 |
+----------------------+------------------------+
10 rows in set (0.00 sec)
Going Back to Ohio
But what if we wanted a little more statistical information on a state's population? Well, there several functions just waiting for you.
mysql> select name, Population as 'Pop',
SUM(Population) OVER w AS 'sum',
RANK() over w as 'rank',
percent_rank() OVER w as '%rank'
from city where District='Ohio'
WINDOW w AS (ORDER BY Population);
+------------+--------+---------+------+-------+
| name | Pop | sum | rank | %rank |
+------------+--------+---------+------+-------+
| Dayton | 166179 | 166179 | 1 | 0 |
| Akron | 217074 | 383253 | 2 | 0.2 |
| Toledo | 313619 | 696872 | 3 | 0.4 |
| Cincinnati | 331285 | 1028157 | 4 | 0.6 |
| Cleveland | 478403 | 1506560 | 5 | 0.8 |
| Columbus | 711470 | 2218030 | 6 | 1 |
+------------+--------+---------+------+-------+
6 rows in set (0.01 sec)
Ohio is used here as it has just enough data to illustrate some concepts. Looking at the cities in that state starting with the largest population is done by creating a window that is ordered by Population, biggest first. Columbus has the largest population and its rank is '1'.
The PERCENTAGE_RANK is the percentage of scores in the window that are equal or lower to the value in the row, excluding the highest value, for the rank on a range from zero to one for that row. For a fun exercise , rerun the above query with the desc removed from the window definition (ORDER BY Population).
Differences
The functions first_value and last_value provide a way to see how much small the population is in the other cities than Columbus. To do this the current row, or last_value, is subtracted from the first value - Columbus's 711,470 - to provide the answer.
mysql> select name,
Population as 'Pop',
SUM(Population) OVER w AS 'sum',
(first_value(Population) over w - last_value(Population) over w) as 'diff'
from city
where District='Ohio'
WINDOW w AS (ORDER BY Population desc) ;
+------------+--------+---------+--------+
| name | Pop | sum | diff |
+------------+--------+---------+--------+
| Columbus | 711470 | 711470 | 0 |
| Cleveland | 478403 | 1189873 | 233067 |
| Cincinnati | 331285 | 1521158 | 380185 |
| Toledo | 313619 | 1834777 | 397851 |
| Akron | 217074 | 2051851 | 494396 |
| Dayton | 166179 | 2218030 | 545291 |
+------------+--------+---------+--------+
6 rows in set (0.00 sec)
mysql> select name, Population as 'Pop',
SUM(Population) OVER w AS 'sum',
RANK() over w as 'rank',
percent_rank() OVER w as '%rank'
from city where District='Ohio'
WINDOW w AS (ORDER BY Population);
+------------+--------+---------+------+-------+
| name | Pop | sum | rank | %rank |
+------------+--------+---------+------+-------+
| Dayton | 166179 | 166179 | 1 | 0 |
| Akron | 217074 | 383253 | 2 | 0.2 |
| Toledo | 313619 | 696872 | 3 | 0.4 |
| Cincinnati | 331285 | 1028157 | 4 | 0.6 |
| Cleveland | 478403 | 1506560 | 5 | 0.8 |
| Columbus | 711470 | 2218030 | 6 | 1 |
+------------+--------+---------+------+-------+
6 rows in set (0.01 sec)
The PERCENTAGE_RANK is the percentage of scores in the window that are equal or lower to the value in the row, excluding the highest value, for the rank on a range from zero to one for that row. For a fun exercise , rerun the above query with the desc removed from the window definition (ORDER BY Population).
Differences
The functions first_value and last_value provide a way to see how much small the population is in the other cities than Columbus. To do this the current row, or last_value, is subtracted from the first value - Columbus's 711,470 - to provide the answer.
mysql> select name,
Population as 'Pop',
SUM(Population) OVER w AS 'sum',
(first_value(Population) over w - last_value(Population) over w) as 'diff'
from city
where District='Ohio'
WINDOW w AS (ORDER BY Population desc) ;
+------------+--------+---------+--------+
| name | Pop | sum | diff |
+------------+--------+---------+--------+
| Columbus | 711470 | 711470 | 0 |
| Cleveland | 478403 | 1189873 | 233067 |
| Cincinnati | 331285 | 1521158 | 380185 |
| Toledo | 313619 | 1834777 | 397851 |
| Akron | 217074 | 2051851 | 494396 |
| Dayton | 166179 | 2218030 | 545291 |
+------------+--------+---------+--------+
6 rows in set (0.00 sec)
But how much smaller are cities three to six from the second city, Cleveland. The NTH_VALUE(expression, n) allows us to subtract the smaller values from the second entry.
mysql> select name, Population as 'Pop',
SUM(Population) OVER w AS 'sum',
(first_value(Population) over w - last_value(Population) over w)
as 'diff Columbus',
(nth_value(Population,2) over w - last_value(Population) over w)
as 'diff Cleveland'
from city where District='Ohio'
WINDOW w AS (ORDER BY Population desc) ;
+------------+--------+---------+---------------+----------------+
| name | Pop | sum | diff Columbus | diff Cleveland |
+------------+--------+---------+---------------+----------------+
| Columbus | 711470 | 711470 | 0 | NULL |
| Cleveland | 478403 | 1189873 | 233067 | 0 |
| Cincinnati | 331285 | 1521158 | 380185 | 147118 |
| Toledo | 313619 | 1834777 | 397851 | 164784 |
| Akron | 217074 | 2051851 | 494396 | 261329 |
| Dayton | 166179 | 2218030 | 545291 | 312224 |
+------------+--------+---------+---------------+----------------+
6 rows in set (0.01 sec)
But how much bigger is each city than the next biggest?? For that we need to use LAG(), LAG() takes the value of column from the row before (So Cleveland's value for LAG is Columbus's Population) and the default lag is 1. The corresponding LEAD() takes the value from the row after. So now it is easy to see that Toledo is only 17,666 behind Cincinnati.
mysql> select name,
Population as 'Pop',
lag(Population,1) over w as 'lag',
lead(Population,1) over w as 'lead',
Population - lag(Population) over w as 'diff'
from city
WHERE District = 'Ohio'
window w as (ORDER BY Population desc);
+------------+--------+--------+--------+---------+
| name | Pop | lag | lead | diff |
+------------+--------+--------+--------+---------+
| Columbus | 711470 | NULL | 478403 | NULL |
| Cleveland | 478403 | 711470 | 331285 | -233067 |
| Cincinnati | 331285 | 478403 | 313619 | -147118 |
| Toledo | 313619 | 331285 | 217074 | -17666 |
| Akron | 217074 | 313619 | 166179 | -96545 |
| Dayton | 166179 | 217074 | NULL | -50895 |
+------------+--------+--------+--------+---------+
6 rows in set (0.00 sec)
Dividing Up Into Buckets
Another useful function is the NTILE which divvies up the data into a certain number of groups or buckets. Be sure to use ORDER BY to ensure the rows you have selected are ordered correctly. Below the cities are split up into a group with four buckets and a group with three buckets.
mysql> SELECT name,
Population as 'Pop',
NTILE(4) over w as 'ntile4',
NTILE(3) over w as 'ntile3'
from city
where District = 'Ohio'
Window w as (ORDER BY Population desc);
+------------+--------+--------+--------+
| name | Pop | ntile4 | ntile3 |
+------------+--------+--------+--------+
| Columbus | 711470 | 1 | 1 |
| Cleveland | 478403 | 1 | 1 |
| Cincinnati | 331285 | 2 | 2 |
| Toledo | 313619 | 2 | 2 |
| Akron | 217074 | 3 | 3 |
| Dayton | 166179 | 4 | 3 |
+------------+--------+--------+--------+
6 rows in set (0.00 sec)