I had an question that made me stop in my tracks -- What Is The Difference Between GROUP BY and Windowing Functions? At one level there is a big difference and at another they are similar, at least in a few respects.
Definitions
GROUP BY is described as "The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of customers in each country". Another source says Window functions belong to a type of function known as a ‘set function’, which means a function that applies to a set of rows. The word ‘window’ is used to refer to the set of rows that the function works on.
So both work on sets of rows. But what are the differences?
Lets start with a goal and for the first example we want to know how many cities are in each District (or state) in the world.city table where the CountryCode = 'USA'. And lets write one query that uses GROUP BY and one query that uses a Windowing Function. This query will be expanded to include more data just like queries expanded by the 'hey, this is nice but can you also get me ...' method.
Example Queries
from city
where CountryCode='usa'
group by District;
select distinct count(ID) over (partition by District),
District
from city
where CountryCode='USA';
Which Query to Use?!?!
I have always used the Occum's Razor rule when confronted with two equivalent queries and picked the simpler of the two. But while visually these took look similar the performance of the GROUP BY is appreciably better.
WITH ROLLUP or OVER()??
So lets make the query work a little harder. Now we want a total off all the IDs counted and with the GROUP BY version we add WITH ROLLUP.
from city
where CountryCode='usa'
group by District WITH ROLLUP;
And at the very end of the output we see the number 274 (the number of rows) and NULL as the name of the district used for the rollup. And yes you are not only stuck with the district name of NULL but explaining that it is a place holder to those wanting this data.
The Windowing Function version's output is not as pretty as it adds a repeating column of '274' to each row of output. The difference is now a selection of count(id) over() to the query.
from city
where CountryCode='USA';
A bit of explanation is needed here. There are now two Windowing Functions being used in this query. The first count(ID) is partitioned by the District column while the second, with the empty parens, is not partitioned -- it is counting all the records.
Now Add the Population
Now let's us pretend we like our queries but now we want to add the sum of the Population column for each district.
District,
sum(Population) as Pop
from city
where CountryCode='usa'
group by District, Population with rollup
Please note that I am limit the output to ten rows to save the length of this blog from the horrendous level. Here the GROUP BY has prioritized Population over District and the results are not exactly usable. We are getting the population rolled up for each state but it is probably not obvious at first glance.
GROUP BY can prioritize what we group the data by but as you see above it is limited by the mechanics of its definition. So we can very quickly outgrow GROUP BY.
Defenestration
Windowing Function let you have a little more control over the results. In the follow example we count the number of IDs per each District and sum the population for each District.
| count | District | Pop |
+-------+----------------------+----------+
| 4 | Alabama | 801519 |
| 1 | Alaska | 260283 |
| 9 | Arizona | 3178903 |
| 1 | Arkansas | 183133 |
| 68 | California | 16716706 |
| 9 | Colorado | 1851149 |
| 5 | Connecticut | 609087 |
| 1 | District of Columbia | 572059 |
| 15 | Florida | 3151408 |
| 6 | Georgia | 1148875 |
+-------+----------------------+----------+
But let us expand the query to include two countries with the goal of getting the per district and per country information included.
+-------+-------------+--------------------+---------------+---------+
| count | CountryCode | Cities in District | District | Pop |
+-------+-------------+--------------------+---------------+---------+
| 1 | BEL | 1 | Antwerpen | 446525 |
| 2 | BEL | 2 | Bryssel | 239551 |
| 1 | BEL | 1 | East Flanderi | 224180 |
| 2 | BEL | 2 | Hainaut | 291762 |
| 1 | BEL | 1 | Liège | 185639 |
| 1 | BEL | 1 | Namur | 105419 |
| 1 | BEL | 1 | West Flanderi | 116246 |
| 4 | USA | 4 | Alabama | 801519 |
| 1 | USA | 1 | Alaska | 260283 |
| 9 | USA | 9 | Arizona | 3178903 |
+-------+-------------+--------------------+---------------+---------+
Now we are partitioning by District over three separate pieces of data. If we need to add another column for the sum of the population of each country, it is simple addition to the query. We do have to specify a new partitioning factor as now we need to group the rows for the new calculation accordingly. Unlike using GROUP BY, there is no ambiguity on what is being prioritized. And we are getting the data the way we want.
+-------+-------------+---+---------------+----------+-------------+
| count | CountryCode | # | District | Pop City | Pop Country |
+-------+-------------+---+---------------+----------+-------------+
| 1 | BEL | 1 | Antwerpen | 446525 | 1609322 |
| 2 | BEL | 2 | Bryssel | 239551 | 1609322 |
| 1 | BEL | 1 | East Flanderi | 224180 | 1609322 |
| 2 | BEL | 2 | Hainaut | 291762 | 1609322 |
| 1 | BEL | 1 | Liège | 185639 | 1609322 |
| 1 | BEL | 1 | Namur | 105419 | 1609322 |
| 1 | BEL | 1 | West Flanderi | 116246 | 1609322 |
| 4 | USA | 4 | Alabama | 801519 | 78625774 |
| 1 | USA | 1 | Alaska | 260283 | 78625774 |
| 9 | USA | 9 | Arizona | 3178903 | 78625774 |
+-------+-------------+---+---------------+----------+-------------+
This is just scratching the surface of how and when to use Windowing Functions over GROUP BY and just skims the surface of the power of Windowing Functions (and yes, you can neaten things up from the examples). I hope to show in a future post some more of this power.