Wednesday, August 1, 2018

A Nice Introduction to MySQL Window Functions III

Windowing Functions can get quite complex very quickly when you start taking advantage of the frame clause. Ranges and rows can get confusing.  So for review lets look at how the specification looks:

Window_spec:
   [window name] [partition clause] [order clause] [frame clause]

That looks simple. And them come terms like UNBOUNDED PRECEDING that could put a knot in your gut.  The manual is not exactly written to help novices in this area get up to speed.  But don't panic.  If you work through the examples that follow (and please do the preceding part of this series before trying these examples) you will have a better appreciation of what is going on with window function.

The Frame Clause


So the frame clause is optional in the window function.  A frame is considered a subset of the current partition and defines that subset.  Frames are determined with respect to the current row. This allows grouping of data within a partition depending on the current row in the partition.  If the frame is defined as all rows  from the start of the end of the partition you can computer running totals for each row.  Or the frame can be defined as extending a desired number of rows  either side of the current row which lets you compute rolling averages.

The first example in this series had the window defined as W OVER() which does not have a frame clause and computes over the entire column.

mysql> select x, 
              sum(x) over() from x;
+------+---------------+
| x    | sum(x) over() |
+------+---------------+
|    1 |            55 |
|    2 |            55 |
|    3 |            55 |
|    4 |            55 |
|    5 |            55 |
|    6 |            55 |
|    7 |            55 |
|    8 |            55 |
|    9 |            55 |
|   10 |            55 |
+------+---------------+
10 rows in set (0.00 sec)

Unbounded

There is usually more than one way to do things. OVER() can also be written as WINDOW w AS (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) or WINDOW w AS (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).

Uh, wha, huh?

Besides the confusion of windows/frames and range/rows, the wording gets tricky.   UNBOUNDED should be taken to mean as 'everything' so UNBOUNDED BEFORE means everything before and UNBOUNDED AFTER means everything after.



mysql> SELECT x, 
             sum(x) over() as 'over', 
             sum(x) OVER a as 'row',
             sum(x) OVER b AS 'range' 
             FROM x 
             window 
a AS (rows between unbounded preceding and unbounded following),    b AS (range between unbounded preceding and unbounded following);
+------+------+------+-------+
| x    | over | row  | range |
+------+------+------+-------+
|    1 |   55 |   55 |    55 |
|    2 |   55 |   55 |    55 |
|    3 |   55 |   55 |    55 |
|    4 |   55 |   55 |    55 |
|    5 |   55 |   55 |    55 |
|    6 |   55 |   55 |    55 |
|    7 |   55 |   55 |    55 |
|    8 |   55 |   55 |    55 |
|    9 |   55 |   55 |    55 |
|   10 |   55 |   55 |    55 |
+------+------+------+-------+
10 rows in set (0.01 sec)

So in the above example the windows are framed so that all the rows are used for the summation of the column when using rows and range.

ROW vs RANGE

Right now you are probably wondering what the difference is between RANGE and ROW.  Let's modify the last example slightly. So we replace between current row and unbounded following as the core of the frame specification.

mysql> SELECT x, 
         sum(x) over() as 'over', 
         sum(x) OVER a as 'row', 
         sum(x) OVER b AS 'range' 
         FROM x 
  window a AS (rows between current row and unbounded following), 
         b AS (range between current row and unbounded following);
+------+------+------+-------+
| x    | over | row  | range |
+------+------+------+-------+
|    1 |   55 |   55 |    55 |
|    2 |   55 |   54 |    55 |
|    3 |   55 |   52 |    55 |
|    4 |   55 |   49 |    55 |
|    5 |   55 |   45 |    55 |
|    6 |   55 |   40 |    55 |
|    7 |   55 |   34 |    55 |
|    8 |   55 |   27 |    55 |
|    9 |   55 |   19 |    55 |
|   10 |   55 |   10 |    55 |
+------+------+------+-------+
10 rows in set (0.00 sec)

The row column now counts down while the range column is unchanged.  What happened? So rows have a frame defined by the beginning and ending row position.  The first row has a sum of all ten items in the column. But the second row has a sum starting with the value of 2 and ending at 10, the third row sums 3 to 10, etcetera.

For range, the frame is defined by rows within a value range of an ORDER BY clause.  But where is the ORDER BY??  Well in this case it is implied to be the entire partition so it takes all the column values.

Range gets a little more differentiated when you use it a little more creatively.

mysql> SELECT x,
             sum(x) over() as 'over', 
             sum(x) OVER a as 'row', 
             sum(x) OVER b AS 'range' 
             FROM x 
             window 
               a AS (rows between current row and 1 following), 
               b AS (order by x 
                     range between current row and 1 following);
+------+------+------+-------+
| x    | over | row  | range |
+------+------+------+-------+
|    1 |   55 |    3 |     3 |
|    2 |   55 |    5 |     5 |
|    3 |   55 |    7 |     7 |
|    4 |   55 |    9 |     9 |
|    5 |   55 |   11 |    11 |
|    6 |   55 |   13 |    13 |
|    7 |   55 |   15 |    15 |
|    8 |   55 |   17 |    17 |
|    9 |   55 |   19 |    19 |
|   10 |   55 |   10 |    10 |
+------+------+------+-------+
10 rows in set (0.06 sec)

The above will sum the current row and the next row.  With row the frame becomes rows between current row and 1 following.  But this time for range the order has to be specified and the frame becomes  order by x range between current row and 1 following. 

It is possible to specify bigger ranges.  

mysql> SELECT x, 
          sum(x) over() as 'over', 
           sum(x) OVER a as 'row', 
           sum(x) OVER b AS 'range' 
          FROM x 
      window a AS (rows between 1 preceding and 3 following), 
        b AS (order by x range between 1 preceding and 3 following);
+------+------+------+-------+
| x    | over | row  | range |
+------+------+------+-------+
|    1 |   55 |   10 |    10 |
|    2 |   55 |   15 |    15 |
|    3 |   55 |   20 |    20 |
|    4 |   55 |   25 |    25 |
|    5 |   55 |   30 |    30 |
|    6 |   55 |   35 |    35 |
|    7 |   55 |   40 |    40 |
|    8 |   55 |   34 |    34 |
|    9 |   55 |   27 |    27 |
|   10 |   55 |   19 |    19 |
+------+------+------+-------+
10 rows in set (0.00 sec)


Hopefully this takes a little bit of the mystery out of windowing functions for novices.

Next time I hope to look into some time series data and windowing functions.