Tuesday, July 31, 2018

A Kind Introduction to MySQL Windowing Functions II

Before I take up from the last blog, I need to introduce RANGE and ROWs.  Windows over data can be framed and this is where things can get wild and woolly.   Table x has a column named x (me being overly creative again) that has the values one through 10.  If we sum the values of x we can get different values depending on how the frame is constructed.

If the frame for the window is defined as a the range between 'unbounded preceding and current row' the value for the sum of x will the sum for the entire column of data.  However, if the frame is defined as the rows between 'unbounded preceding and current row' it will sum up the values of the current row and the values of the rows that came before; 1, 1+2, 1+2+3. etc.


mysql> SELECT x, sum(x) over w as 'rows',
    -> sum(x) over y as 'range'
    -> from x
    -> window w as 
           (rows between unbounded preceding and current row),
    ->        y as 
           (range between unbounded preceding and current row);
+------+------+-------+
| x    | rows | range |
+------+------+-------+
|    1 |    1 |    55 |
|    2 |    3 |    55 |
|    3 |    6 |    55 |
|    4 |   10 |    55 |
|    5 |   15 |    55 |
|    6 |   21 |    55 |
|    7 |   28 |    55 |
|    8 |   36 |    55 |
|    9 |   45 |    55 |
|   10 |   55 |    55 |
+------+------+-------+
10 rows in set (0.00 sec)

The terminology of frames and windows gets a little confusing and sound like a home remodeling project.  For now consider 'unbounded' to be everything in the column, so unbounded preceding would be everything in the column before this row. 


mysql> SELECT x, 
       sum(x) over w as 'rows', 
       sum(x) over y as 'range' 
       from x 
       window w as 
         (rows between current row and unbounded following), 
         y as (range between unbounded preceding and current row);
+------+------+-------+
| x    | rows | range |
+------+------+-------+
|    1 |   55 |    55 |
|    2 |   54 |    55 |
|    3 |   52 |    55 |
|    4 |   49 |    55 |
|    5 |   45 |    55 |
|    6 |   40 |    55 |
|    7 |   34 |    55 |
|    8 |   27 |    55 |
|    9 |   19 |    55 |
|   10 |   10 |    55 |
+------+------+-------+
10 rows in set (0.00 sec)

So unbounded following would take all the values in the column after the current.  So the rows value where x is equal to 2 of 54 is the grand total sum of 55 less the value of the first row or 54 (and x =3 is 55 less 1 & 2).

More frames and windows next time!



More State Stuff


Last  time was the start of a series on windowing functions that were introduced with MySQL 8. This entry will delve a little bit deeper into this realm.  The data used for these examples starts off with the World database.

Population By State


Previously the data showed a total population for each state but listed an entry for each city in that state, er, district.  It is quite easy to get a per state total by adding a simple GROUP BY to the query.


mysql> SELECT district, Sum(Population) OVER w 
FROM city 
WHERE CountryCode = 'USA' 
GROUP BY District  
WINDOW w AS (PARTITION BY District) 
LIMIT 10;
+----------------------+------------------------+
| district             | Sum(Population) OVER w |
+----------------------+------------------------+
| Alabama              |                 242820 |
| Alaska               |                 260283 |
| Arizona              |                1321045 |
| Arkansas             |                 183133 |
| California           |                3694820 |
| Colorado             |                 554636 |
| Connecticut          |                 139529 |
| District of Columbia |                 572059 |
| Florida              |                 735167 |
| Georgia              |                 416474 |
+----------------------+------------------------+
10 rows in set (0.00 sec)


Going Back to Ohio


But what if we wanted a little more statistical information on a state's population?  Well, there several functions just waiting for you.

mysql> select name, Population as 'Pop', 
       SUM(Population) OVER w AS 'sum', 
       RANK() over w as 'rank', 
       percent_rank() OVER w as '%rank' 
       from city where District='Ohio' 
       WINDOW w AS (ORDER BY Population);
+------------+--------+---------+------+-------+
| name       | Pop    | sum     | rank | %rank |
+------------+--------+---------+------+-------+
| Dayton     | 166179 |  166179 |    1 |     0 |
| Akron      | 217074 |  383253 |    2 |   0.2 |
| Toledo     | 313619 |  696872 |    3 |   0.4 |
| Cincinnati | 331285 | 1028157 |    4 |   0.6 |
| Cleveland  | 478403 | 1506560 |    5 |   0.8 |
| Columbus   | 711470 | 2218030 |    6 |     1 |
+------------+--------+---------+------+-------+
6 rows in set (0.01 sec)


Ohio is used here as it has just enough data to illustrate some concepts.  Looking at the cities in that state starting with the largest population is done by creating a window that is ordered by Population, biggest first.  Columbus has the largest population and its rank is '1'.

The PERCENTAGE_RANK  is the percentage of scores in the window that are equal or lower to the value in the row, excluding the highest value, for the rank on a range from zero to one for that row.  For a fun exercise , rerun the above query with the desc removed from the window definition (ORDER BY Population).

Differences

The functions first_value and last_value provide a way to see how much small the population is in the other cities than Columbus. To do this the current row, or last_value, is subtracted from the  first value  - Columbus's 711,470 - to provide the answer.

 mysql> select name, 
         Population as 'Pop', 
         SUM(Population) OVER w AS 'sum', 
        (first_value(Population) over w - last_value(Population) over w) as 'diff'  
        from city 
        where District='Ohio' 
        WINDOW w AS (ORDER BY Population desc) ;
+------------+--------+---------+--------+
| name       | Pop    | sum     | diff   |
+------------+--------+---------+--------+
| Columbus   | 711470 |  711470 |      0 |
| Cleveland  | 478403 | 1189873 | 233067 |
| Cincinnati | 331285 | 1521158 | 380185 |
| Toledo     | 313619 | 1834777 | 397851 |
| Akron      | 217074 | 2051851 | 494396 |
| Dayton     | 166179 | 2218030 | 545291 |
+------------+--------+---------+--------+
6 rows in set (0.00 sec)

But how much smaller are cities three to six from the second city, Cleveland.  The NTH_VALUE(expression, n) allows us to subtract the smaller values from the second entry.

mysql> select name, Population as 'Pop', 
       SUM(Population) OVER w AS 'sum', 
   (first_value(Population) over w - last_value(Population) over w) 
            as 'diff Columbus', 
   (nth_value(Population,2) over w - last_value(Population) over w) 
            as 'diff Cleveland'  
      from city where District='Ohio' 
       WINDOW w AS (ORDER BY Population desc) ;
+------------+--------+---------+---------------+----------------+
| name       | Pop    | sum     | diff Columbus | diff Cleveland |
+------------+--------+---------+---------------+----------------+
| Columbus   | 711470 |  711470 |             0 |           NULL |
| Cleveland  | 478403 | 1189873 |        233067 |              0 |
| Cincinnati | 331285 | 1521158 |        380185 |         147118 |
| Toledo     | 313619 | 1834777 |        397851 |         164784 |
| Akron      | 217074 | 2051851 |        494396 |         261329 |
| Dayton     | 166179 | 2218030 |        545291 |         312224 |
+------------+--------+---------+---------------+----------------+
6 rows in set (0.01 sec)

But how much bigger is each city than the next biggest?? For that we need to use LAG(),  LAG() takes the value of column from the row before (So Cleveland's value for LAG is Columbus's Population) and the default lag is 1.  The corresponding LEAD() takes the value from the row after.  So now it is easy to see that Toledo is only 17,666 behind Cincinnati.

mysql> select name, 
       Population as 'Pop', 
       lag(Population,1) over w as 'lag', 
       lead(Population,1) over w as 'lead', 
       Population - lag(Population) over w as 'diff'  
       from city 
       WHERE District = 'Ohio' 
       window w as (ORDER BY Population desc);
+------------+--------+--------+--------+---------+
| name       | Pop    | lag    | lead   | diff    |
+------------+--------+--------+--------+---------+
| Columbus   | 711470 |   NULL | 478403 |    NULL |
| Cleveland  | 478403 | 711470 | 331285 | -233067 |
| Cincinnati | 331285 | 478403 | 313619 | -147118 |
| Toledo     | 313619 | 331285 | 217074 |  -17666 |
| Akron      | 217074 | 313619 | 166179 |  -96545 |
| Dayton     | 166179 | 217074 |   NULL |  -50895 |
+------------+--------+--------+--------+---------+
6 rows in set (0.00 sec)



Dividing Up Into Buckets


Another useful function is the NTILE which divvies up the data into a certain number of groups or buckets.  Be sure to use ORDER BY to ensure the rows you have selected are ordered correctly.  Below the cities are split up into a group with four buckets and a group with three buckets.  

mysql> SELECT name, 
              Population as 'Pop', 
              NTILE(4) over w as 'ntile4', 
              NTILE(3) over w as 'ntile3' 
              from city 
              where District = 'Ohio' 
              Window w as (ORDER BY Population desc);
+------------+--------+--------+--------+
| name       | Pop    | ntile4 | ntile3 |
+------------+--------+--------+--------+
| Columbus   | 711470 |      1 |      1 |
| Cleveland  | 478403 |      1 |      1 |
| Cincinnati | 331285 |      2 |      2 |
| Toledo     | 313619 |      2 |      2 |
| Akron      | 217074 |      3 |      3 |
| Dayton     | 166179 |      4 |      3 |
+------------+--------+--------+--------+
6 rows in set (0.00 sec)