Monday, January 29, 2018

MySQL 8 Windowing Functions

Windowing functions are way to group rows of data for analysis. This provides a 'window' to look at only the relevant data only.  Plus there are about a dozen supporting functions to take advantage of all this. There is some carry over logically from the aggregate (or group by) functions but they open up some easy ways to dig statistically into your data.

Now for a contrived example.


mysql> Select row_number() over w as '#', 
       Name, Continent, Population, 
       sum(Population) over w as 'Pop' 
       from country where continent='South America' 
       window w as (partition by Continent Order by Continent);                                                                                
+----+------------------+---------------+------------+-----------+
| #  | Name             | Continent     | Population | Pop       |
+----+------------------+---------------+------------+-----------+
|  1 | Argentina        | South America |   37032000 | 345780000 |
|  2 | Bolivia          | South America |    8329000 | 345780000 |
|  3 | Brazil           | South America |  170115000 | 345780000 |
|  4 | Chile            | South America |   15211000 | 345780000 |
|  5 | Colombia         | South America |   42321000 | 345780000 |
|  6 | Ecuador          | South America |   12646000 | 345780000 |
|  7 | Falkland Islands | South America |       2000 | 345780000 |
|  8 | French Guiana    | South America |     181000 | 345780000 |
|  9 | Guyana           | South America |     861000 | 345780000 |
| 10 | Peru             | South America |   25662000 | 345780000 |
| 11 | Paraguay         | South America |    5496000 | 345780000 |
| 12 | Suriname         | South America |     417000 | 345780000 |
| 13 | Uruguay          | South America |    3337000 | 345780000 |
| 14 | Venezuela        | South America |   24170000 | 345780000 |
+----+------------------+---------------+------------+-----------+
14 rows in set (0.00 sec)


In the above example, we created a window named 'w' to allows us to 'peek' at the data arranged by Continent. I am cheating here by only looking at the Continent of South America. You will get a much better look at all this if you remove the WHERE continent='South America' from the query but I use it here for brevity.

And I used two Windows functions on the data from that window.  The ROW_NUMBER()  function provides a nice way to do exactly what the name of this function says.  And the SUM() function adds up all the population columns.

We can even set up ranks, or buckets, to divide up the data.



mysql> Select row_number() over w as '#', Name, Population, 
      ntile(5) over w as 'tile', sum(Population) over w as 'Pop' 
      from country where continent='South America' 
      window w as (partition by Continent Order by Continent);
+----+------------------+------------+------+-----------+
| #  | Name             | Population | tile | Pop       |
+----+------------------+------------+------+-----------+
|  1 | Argentina        |   37032000 |    1 | 345780000 |
|  2 | Bolivia          |    8329000 |    1 | 345780000 |
|  3 | Brazil           |  170115000 |    1 | 345780000 |
|  4 | Chile            |   15211000 |    2 | 345780000 |
|  5 | Colombia         |   42321000 |    2 | 345780000 |
|  6 | Ecuador          |   12646000 |    2 | 345780000 |
|  7 | Falkland Islands |       2000 |    3 | 345780000 |
|  8 | French Guiana    |     181000 |    3 | 345780000 |
|  9 | Guyana           |     861000 |    3 | 345780000 |
| 10 | Peru             |   25662000 |    4 | 345780000 |
| 11 | Paraguay         |    5496000 |    4 | 345780000 |
| 12 | Suriname         |     417000 |    4 | 345780000 |
| 13 | Uruguay          |    3337000 |    5 | 345780000 |
| 14 | Venezuela        |   24170000 |    5 | 345780000 |
+----+------------------+------------+------+-----------+
14 rows in set (0.00 sec)
mysql>

Here we used NTILE(5) to divide the results into five bucks.  By the way change the 5 to a 4 and you have quartiles or 100 for percentiles.  This does not really mean much statistically since the countries are arranged alphabetically.

So lets make this a little more statistically meaningful. Lets look at the population of South America with the largest countries first and broken into quartiles.


mysql> Select row_number() over w as '#', Name, Population, ntile(4) over w as 'tile', 
       sum(Population) over w as 'Pop' 
       from country where continent='South America' 
       window w as (partition by Continent Order by Population desc);
+----+------------------+------------+------+-----------+
| #  | Name             | Population | tile | Pop       |
+----+------------------+------------+------+-----------+
|  1 | Brazil           |  170115000 |    1 | 170115000 |
|  2 | Colombia         |   42321000 |    1 | 212436000 |
|  3 | Argentina        |   37032000 |    1 | 249468000 |
|  4 | Peru             |   25662000 |    1 | 275130000 |
|  5 | Venezuela        |   24170000 |    2 | 299300000 |
|  6 | Chile            |   15211000 |    2 | 314511000 |
|  7 | Ecuador          |   12646000 |    2 | 327157000 |
|  8 | Bolivia          |    8329000 |    2 | 335486000 |
|  9 | Paraguay         |    5496000 |    3 | 340982000 |
| 10 | Uruguay          |    3337000 |    3 | 344319000 |
| 11 | Guyana           |     861000 |    3 | 345180000 |
| 12 | Suriname         |     417000 |    4 | 345597000 |
| 13 | French Guiana    |     181000 |    4 | 345778000 |
| 14 | Falkland Islands |       2000 |    4 | 345780000 |
+----+------------------+------------+------+-----------+
14 rows in set (0.00 sec)

Now notice the Pop column as it suddenly becomes a very useful running total.

 I have only touched a few of the new functions to support Windowing functions but there is much more of interest here.