Wednesday, August 19, 2020

What Is The Difference Between GROUP BY and Windowing Functions? PART I

    I had an question that made me stop in my tracks -- What Is The Difference Between GROUP BY and Windowing Functions?  At one level there is a big difference and at another they are similar, at least in a few respects.

Definitions

    GROUP BY is described as "The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of customers in each country". Another source says Window functions belong to a type of function known as a ‘set function’, which means a function that applies to a set of rows. The word ‘window’ is used to refer to the set of rows that the function works on.

    So both work on sets of rows. But what are the differences?

    Lets start with a goal and for the first example we want to know how many cities are in each District (or state) in the world.city table where the CountryCode = 'USA'. And lets write one query that uses GROUP BY and one query that uses a Windowing Function.  This query will be expanded to include more data just like queries expanded by the 'hey, this is nice but can you also get me ...' method.

Example Queries

    An example using GROUP BY

select count(id), 
       District 
from city 
where CountryCode='usa' 
group by District;

    And an example using a Windowing Function.  If you are unfamiliar with windowing functions the key word to look for is OVER()

select distinct count(ID) over (partition by District)
    District  
 from city 
 where CountryCode='USA';

    Not too much of a difference. What about the query plan?  Using EXPLAIN FORMAT=JSON reveals the query cost is 102.4 for the GROUPY BY and 276.4 for the Windowing Function.  

 Which Query to Use?!?!

    I have always used the Occum's Razor rule when confronted with two equivalent queries and picked the simpler of the two. But while visually these took look similar the performance of the GROUP BY is appreciably better. 

WITH ROLLUP or OVER()??

    So lets make the query work a little harder.  Now we want a total off all the IDs counted and with the GROUP BY version we add WITH ROLLUP.

select count(id), 
        District 
from city 
where CountryCode='usa' 
group by District WITH ROLLUP;

    And at the very end of the output we see the number 274 (the number of rows) and NULL as the name of the district used for the rollup. And yes you are not only stuck with the district name of NULL but explaining that it is a place holder to those wanting this data.  

    The Windowing Function version's output is not as pretty as it adds a repeating column of '274' to each row of output. The difference is now a selection of count(id) over() to the query.

select distinct count(ID) over (partition by District), 
                count(ID) over()
District 
from city 
where CountryCode='USA';

    A bit of explanation is needed here.  There are now two Windowing Functions being used in this query.  The first count(ID) is partitioned by the District column while the second, with the empty parens, is not partitioned -- it is counting all the records.

Now Add the Population

    Now let's us pretend we like our queries but now we want to add the sum of the Population column for each district.

select count(id), 
        District, 
        sum(Population) as Pop 
       from city 
        where CountryCode='usa' 
        group by District, Population with rollup
                    limit 5;

    Please note that I am limit the output to ten rows to save the length of this blog from the horrendous level.  Here the GROUP BY has prioritized Population over District and the results are not exactly usable. We are getting the population rolled up for each state but it is probably not obvious at first glance.

+-----------+----------+--------+
| count(id) | District | Pop    |
+-----------+----------+--------+
|         1 | Alabama  | 158216 |
|         1 | Alabama  | 198915 |
|         1 | Alabama  | 201568 |
|         1 | Alabama  | 242820 |
|         4 | Alabama  | 801519 |
|         1 | Alaska   | 260283 |
|         1 | Alaska   | 260283 |
|         1 | Arizona  | 108364 |
|         1 | Arizona  | 109697 |
|         1 | Arizona  | 158625 |
+-----------+----------+--------+

  So lets switch the group by to 'GROUP BY Population, District' to prioritize Population. And what we get is not much better. The count of the cities in the individual districts is gone.  We are not getting the population sum by District with the ROLLUP but the output is not what we want.

+-----------+----------------+-------+
| count(id) | District       | Pop   |
+-----------+----------------+-------+
|         1 | South Carolina | 89063 |
|         1 | NULL           | 89063 |
|         1 | California     | 89089 |
|         1 | NULL           | 89089 |
|         1 | Texas          | 89293 |
|         1 | NULL           | 89293 |
|         1 | Illinois       | 89408 |
|         1 | NULL           | 89408 |
|         1 | Wisconsin      | 89447 |
|         1 | NULL           | 89447 |
+-----------+----------------+-------+

       GROUP BY can prioritize what we group the data by but as you see above it is limited by the mechanics of its definition. So we can very quickly outgrow GROUP BY.

Defenestration

    Windowing Function let you have a little more control over the results.  In the follow example we count the number of IDs per each District and sum the population for each District.      

select distinct count(ID) over (partition by District) as count,            District, 
        SUM(Population) over (partition by District)  as 'Pop' 
from city 
where CountryCode='USA' 
limit 10;
+-------+----------------------+----------+
| count | District             | Pop      |
+-------+----------------------+----------+
|     4 | Alabama              |   801519 |
|     1 | Alaska               |   260283 |
|     9 | Arizona              |  3178903 |
|     1 | Arkansas             |   183133 |
|    68 | California           | 16716706 |
|     9 | Colorado             |  1851149 |
|     5 | Connecticut          |   609087 |
|     1 | District of Columbia |   572059 |
|    15 | Florida              |  3151408 |
|     6 | Georgia              |  1148875 |
+-------+----------------------+----------+


    But let us expand the query to include two countries with the goal of getting the per district and per country information included.

select distinct count(ID) over (partition by District) as count
        CountryCode, 
        count(CountryCode) over (partition by District) 
               as 'Cities in District',             
        District, 
        SUM(Population) over (partition by District)  as 'Pop' 
from city 
where CountryCode IN ('USA','BEL') 
order by CountryCode,District 
limit 10;
+-------+-------------+--------------------+---------------+---------+
| count | CountryCode | Cities in District | District      | Pop     |
+-------+-------------+--------------------+---------------+---------+
|     1 | BEL         |                  1 | Antwerpen     |  446525 |
|     2 | BEL         |                  2 | Bryssel       |  239551 |
|     1 | BEL         |                  1 | East Flanderi |  224180 |
|     2 | BEL         |                  2 | Hainaut       |  291762 |
|     1 | BEL         |                  1 | Liège        |  185639 |
|     1 | BEL         |                  1 | Namur         |  105419 |
|     1 | BEL         |                  1 | West Flanderi |  116246 |
|     4 | USA         |                  4 | Alabama       |  801519 |
|     1 | USA         |                  1 | Alaska        |  260283 |
|     9 | USA         |                  9 | Arizona       | 3178903 |
+-------+-------------+--------------------+---------------+---------+

    Now we are partitioning by District over three separate pieces of data.  If we need to add another column for the sum of the population of each country, it is simple addition to the query. We do have to specify a new partitioning factor as now we need to group the rows for the new calculation accordingly.  Unlike using GROUP BY, there is no ambiguity on what is being prioritized. And we are getting the data the way we want.

select distinct count(ID) over (partition by District) as count,            CountryCode,     
        count(CountryCode) over (partition by District) as '#',            District, 
        SUM(Population) over (partition by District)  as 'Pop City',         SUM(Population) over (partition by CountryCode) 
            as 'Pop Country' 
from city 
where CountryCode IN ('USA','BEL') 
order by CountryCode,District 
limit 10;
+-------+-------------+---+---------------+----------+-------------+
| count | CountryCode | # | District      | Pop City | Pop Country |
+-------+-------------+---+---------------+----------+-------------+
|     1 | BEL         | 1 | Antwerpen     |   446525 |     1609322 |
|     2 | BEL         | 2 | Bryssel       |   239551 |     1609322 |
|     1 | BEL         | 1 | East Flanderi |   224180 |     1609322 |
|     2 | BEL         | 2 | Hainaut       |   291762 |     1609322 |
|     1 | BEL         | 1 | Liège        |   185639 |     1609322 |
|     1 | BEL         | 1 | Namur         |   105419 |     1609322 |
|     1 | BEL         | 1 | West Flanderi |   116246 |     1609322 |
|     4 | USA         | 4 | Alabama       |   801519 |    78625774 |
|     1 | USA         | 1 | Alaska        |   260283 |    78625774 |
|     9 | USA         | 9 | Arizona       |  3178903 |    78625774 |
+-------+-------------+---+---------------+----------+-------------+

    This is just scratching the surface of how and when to use Windowing Functions over GROUP BY and just skims the surface of the power of Windowing Functions (and yes, you can neaten things up from the examples). I hope to show in a future post some more of this power.