Sunday, July 29, 2018

A Kind Introduction MySQL Windowing Functions Part I


Windowing functions are a critical tool for grouping rows of data that are related to other rows. But they go far beyond the regular aggregate functions found in MySQL 5.7 and earlier. In MySQL 8 you do not have to collapse all the information down into a single output row. Each row can retain its individual identity but the server can analyze the data as a unit.

Statistics and Damned Lies

Finding the total Population of the District Texas from the world.city table is simple. 

SQL> select District, sum(Population)  
from city where district = 'Texas';
+----------+-----------------+
| District | sum(Population) |
+----------+-----------------+
| Texas    |         9208281 |
+----------+-----------------+
1 row in set (0.0068 sec)






 Simple.  But try to expand it to the entire USA and you get problems.

SQL> select District, sum(Population)  
from city where CountryCode = 'USA';
+----------+-----------------+
| District | sum(Population) |
+----------+-----------------+
| New York |        78625774 |
+----------+-----------------+
1 row in set (0.0046 sec)


The results only give out the results for all the cities and lumps them under New York. This is not the desired answer. By the way the only time New York (and New York city in particular) has 78 million people is when they are all on the road in front of me when I am trying to take a taxi to an airport.

With a windowing function it is easy to iterate over a subset of the entire data.  Imagine reading the data through a page with a section cut out to form a window that is just the right size to read only the group of rows desired!

SQL> select name, District, Population, sum(Population) over() as p0, 
     District, sum(population) over( partition by District) as p1
     from city where CountryCode = 'USA' limit 10;
+------------+----------+------------+----------+----------+---------+
| name       | District | Population | p0       | District | p1      |
+------------+----------+------------+----------+----------+---------+
| Birmingham | Alabama  |     242820 | 78625774 | Alabama  |  801519 |
| Montgomery | Alabama  |     201568 | 78625774 | Alabama  |  801519 |
| Mobile     | Alabama  |     198915 | 78625774 | Alabama  |  801519 |
| Huntsville | Alabama  |     158216 | 78625774 | Alabama  |  801519 |
| Anchorage  | Alaska   |     260283 | 78625774 | Alaska   |  260283 |
| Phoenix    | Arizona  |    1321045 | 78625774 | Arizona  | 3178903 |
| Tucson     | Arizona  |     486699 | 78625774 | Arizona  | 3178903 |
| Mesa       | Arizona  |     396375 | 78625774 | Arizona  | 3178903 |
| Glendale   | Arizona  |     218812 | 78625774 | Arizona  | 3178903 |
| Scottsdale | Arizona  |     202705 | 78625774 | Arizona  | 3178903 |
+------------+----------+------------+----------+----------+---------+
10 rows in set (0.0075 sec)


The above query has two windows.  The keyword to notice is OVER().  The window defined is OVER() with nothing within the parenthesis as this should be under stood to mean 'the widow is open wide enough to see all the data'. So sum(Population) over() as p0 will give us the sum of all the Population columns and name the column p0.

The second window is defined as sum(population) over (partition by District) as p1 will provide all of the Population of each district summed in a column named p1.  

Indianapolis 500 winning Lotus driven by Jimmy Clark. Before this car all other winners of the race had the engine in front of the driver.  What does this have to do with Windowing Functions, databases, SQL, or anything else in this blog post. Well, nothing but while at Detroit PHP I stopped by the Henry Ford Museum and took this picture of a very icon car that changed Indy forever.  Windowing Functions can change your SQL skills.

Different Types of Windows

The OVER clause has two forms - window_spec and window_name.  The window_spec option declares the specified window with the parenthesis.  While window_name is a window defined elsewhere in the query.  The send window in the above query  (where the output is named p1) is a window_spec.

So here is a window_name example:

SQL> SELECT District, Sum(Population) OVER w 
    FROM city where CountryCode ='USA' 
    WINDOW w AS (partition by District) limit 10;
+----------+------------------------+
| District | Sum(Population) OVER w |
+----------+------------------------+
| Alabama  |                 801519 |
| Alabama  |                 801519 |
| Alabama  |                 801519 |
| Alabama  |                 801519 |
| Alaska   |                 260283 |
| Arizona  |                3178903 |
| Arizona  |                3178903 |
| Arizona  |                3178903 |
| Arizona  |                3178903 |
| Arizona  |                3178903 |
+----------+------------------------+
10 rows in set (0.0032 sec)


The window was given the name w and then defined as WINDOW w AS (partition by District).   By the way the declaration within the window_name is itself as window_spec.

So what is a window_spec??

 

The definition from the manual (https://dev.mysql.com/doc/refman/8.0/en/window-functions-usage.html) informs that a window_spec is defined as:

window_spec: [window_name] [partition_clause] [order_clause] [frame_clause]

The window_name is an alias that can be used elsewhere in the query.

The partition_clause is how the data is to be divided up into groups. If this is unspecified it makes one big group.  

The order_clause provides the sort_order. Remember the ORDER BY from regular SQL queries?  This is how you can order the groups.

And the frame_clause determines sub groups within the big group.

And as it so often happens, there is a lot of material to be covered under the frame_clause and that will be in a future blog.