Open Source Databases: January 2018

Monday, January 29, 2018

MySQL 8 Windowing Functions

Windowing functions are way to group rows of data for analysis. This provides a 'window' to look at only the relevant data only. Plus there are about a dozen supporting functions to take advantage of all this. There is some carry over logically from the aggregate (or group by) functions but they open up some easy ways to dig statistically into your data.

Now for a contrived example.


mysql> Select row_number() over w as '#', 
       Name, Continent, Population, 
       sum(Population) over w as 'Pop' 
       from country where continent='South America' 
       window w as (partition by Continent Order by Continent);                                                                                
+----+------------------+---------------+------------+-----------+
| #  | Name             | Continent     | Population | Pop       |
+----+------------------+---------------+------------+-----------+
|  1 | Argentina        | South America |   37032000 | 345780000 |
|  2 | Bolivia          | South America |    8329000 | 345780000 |
|  3 | Brazil           | South America |  170115000 | 345780000 |
|  4 | Chile            | South America |   15211000 | 345780000 |
|  5 | Colombia         | South America |   42321000 | 345780000 |
|  6 | Ecuador          | South America |   12646000 | 345780000 |
|  7 | Falkland Islands | South America |       2000 | 345780000 |
|  8 | French Guiana    | South America |     181000 | 345780000 |
|  9 | Guyana           | South America |     861000 | 345780000 |
| 10 | Peru             | South America |   25662000 | 345780000 |
| 11 | Paraguay         | South America |    5496000 | 345780000 |
| 12 | Suriname         | South America |     417000 | 345780000 |
| 13 | Uruguay          | South America |    3337000 | 345780000 |
| 14 | Venezuela        | South America |   24170000 | 345780000 |
+----+------------------+---------------+------------+-----------+
14 rows in set (0.00 sec)

In the above example, we created a window named 'w' to allows us to 'peek' at the data arranged by Continent. I am cheating here by only looking at the Continent of South America. You will get a much better look at all this if you remove the WHERE continent='South America' from the query but I use it here for brevity.

And I used two Windows functions on the data from that window. The ROW_NUMBER() function provides a nice way to do exactly what the name of this function says. And the SUM() function adds up all the population columns.

We can even set up ranks, or buckets, to divide up the data.


mysql> Select row_number() over w as '#', Name, Population,

      ntile(5) over w as 'tile', sum(Population) over w as 'Pop'

      from country where continent='South America'

      window w as (partition by Continent Order by Continent);
+----+------------------+------------+------+-----------+
| #  | Name             | Population | tile | Pop       |
+----+------------------+------------+------+-----------+
|  1 | Argentina        |   37032000 |    1 | 345780000 |
|  2 | Bolivia          |    8329000 |    1 | 345780000 |
|  3 | Brazil           |  170115000 |    1 | 345780000 |
|  4 | Chile            |   15211000 |    2 | 345780000 |
|  5 | Colombia         |   42321000 |    2 | 345780000 |
|  6 | Ecuador          |   12646000 |    2 | 345780000 |
|  7 | Falkland Islands |       2000 |    3 | 345780000 |
|  8 | French Guiana    |     181000 |    3 | 345780000 |
|  9 | Guyana           |     861000 |    3 | 345780000 |
| 10 | Peru             |   25662000 |    4 | 345780000 |
| 11 | Paraguay         |    5496000 |    4 | 345780000 |
| 12 | Suriname         |     417000 |    4 | 345780000 |
| 13 | Uruguay          |    3337000 |    5 | 345780000 |
| 14 | Venezuela        |   24170000 |    5 | 345780000 |
+----+------------------+------------+------+-----------+
14 rows in set (0.00 sec)

mysql>

Here we used NTILE(5) to divide the results into five bucks. By the way change the 5 to a 4 and you have quartiles or 100 for percentiles. This does not really mean much statistically since the countries are arranged alphabetically.

So lets make this a little more statistically meaningful. Lets look at the population of South America with the largest countries first and broken into quartiles.


mysql> Select row_number() over w as '#', Name, Population, ntile(4) over w as 'tile',

       sum(Population) over w as 'Pop'

       from country where continent='South America'

       window w as (partition by Continent Order by Population desc);
+----+------------------+------------+------+-----------+
| #  | Name             | Population | tile | Pop       |
+----+------------------+------------+------+-----------+
|  1 | Brazil           |  170115000 |    1 | 170115000 |
|  2 | Colombia         |   42321000 |    1 | 212436000 |
|  3 | Argentina        |   37032000 |    1 | 249468000 |
|  4 | Peru             |   25662000 |    1 | 275130000 |
|  5 | Venezuela        |   24170000 |    2 | 299300000 |
|  6 | Chile            |   15211000 |    2 | 314511000 |
|  7 | Ecuador          |   12646000 |    2 | 327157000 |
|  8 | Bolivia          |    8329000 |    2 | 335486000 |
|  9 | Paraguay         |    5496000 |    3 | 340982000 |
| 10 | Uruguay          |    3337000 |    3 | 344319000 |
| 11 | Guyana           |     861000 |    3 | 345180000 |
| 12 | Suriname         |     417000 |    4 | 345597000 |
| 13 | French Guiana    |     181000 |    4 | 345778000 |
| 14 | Falkland Islands |       2000 |    4 | 345780000 |
+----+------------------+------------+------+-----------+
14 rows in set (0.00 sec)

Now notice the Pop column as it suddenly becomes a very useful running total.

I have only touched a few of the new functions to support Windowing functions but there is much more of interest here.

Monday, January 8, 2018

Common Table Expressions (CTEs) Part 1

Occasionally at conference or a Meetup, someone will approach me and ask me for help with a MySQL problem. Eight out of ten times their difficulty includes a sub query. "I get an error message about a corrugated or conflabugated sub query or some such," they say, desperate for help. Usually with a bit of fumbling we can get their problem solved. The problem is not a lack of knowledge for either of us but that sub queries are often hard to write.

MySQL 8 will be the first version of the most popular database on the web with Common Table Expressions or CTEs. CTEs are a way to create temporary tables and then use that temporary table for queries. Think of them as easy to write sub queries!

WITH is The Magic Word

The new CTE magic is indicated with the WITH clause.

mysql> WITH myfirstCTE 
      AS (SELECT * FROM world.city WHERE CountryCode='USA')
SELECT Name, District
FROM myfirstCTE 
ORDER BY Name
LIMIT 5;
+-------------+------------+
| Name        | District   |
+-------------+------------+
| Abilene     | Texas      |
| Akron       | Ohio       |
| Albany      | New York   |
| Albuquerque | New Mexico |
| Alexandria  | Virginia   |
+-------------+------------+
5 rows in set (0.01 sec)

mysql>

So in the above example we create at temporary table named 'myfirstCTE' and then query from that table. The CTE itself isa very simple query, SELECT * FROM world.city. And then two columns are plucked out of 'myfirstCTE'. The CTE can even have modifiers on the query for ORDER BY, LIMIT or any other SQL qualifier.

A Little More Complex Example

Lets try a little more complex query.

mysql> WITH secondCTE AS 
     (SELECT city.Name AS A, country.Name as B, city.District AS C 
     FROM city 
     JOIN country on (city.CountryCode=country.Code)) 
     SELECT A, B, C FROM secondCTE LIMIT 5;
+----------------+-------------+----------+
| A              | B           | C        |
+----------------+-------------+----------+
| Oranjestad     | Aruba       | –        |
| Kabul          | Afghanistan | Kabol    |
| Qandahar       | Afghanistan | Qandahar |
| Herat          | Afghanistan | Herat    |
| Mazar-e-Sharif | Afghanistan | Balkh    |
+----------------+-------------+----------+
5 rows in set (0.04 sec)

mysql>

Here we select columns A, B, C from temporary table 'secondCTE'. In this case it is easier to see that we are getting data from the CTE and not the two tables city or country. At least directly.

The reason I use this examples is often times you have to join two or more tables and the columns with the name name have different sources, here name of country and name of city. Yes, you could alias them in a sub query but the legibility of the query increases. And as query complexity builds things like the old N+1 problems raise their ugly heads.

More Than One Way to Skin a Cat

So lets play again with the information from the second example but this time we will write two separate CTEs and then join them together. First we are going to get the Name column from the city table and the Name column from the country table. We do have to drag along the columns for a join in the CTEs. I think for many with limited SQL experience that this version is a little easier to comprehend than the second example.

mysql> WITH threeA AS (SELECT Name AS N1, CountryCode AS x FROM city),
    -> threeB AS (SELECT Name AS N2, Code as y FROM country)
    -> SELECT N1, N2 FROM threeA JOIN threeB WHERE threeA.x = threeB.y LIMIT 5;
+----------------+-------------+
| N1             | N2          |
+----------------+-------------+
| Oranjestad     | Aruba       |
| Kabul          | Afghanistan |
| Qandahar       | Afghanistan |
| Herat          | Afghanistan |
| Mazar-e-Sharif | Afghanistan |
+----------------+-------------+
5 rows in set (0.01 sec)

mysql>

Hopefully the use of CTEs will produce easier to read SQL statements.

Tuesday, January 2, 2018

Two New MySQL Books!

There are two new MySQL books both from Apress Press. One is an in depth master course on the subject and the other is a quick introduction.

ProMySQL NDB Cluster is subtitled Master the MySQL Cluster Lifecycle and at nearly 700 pages it is vital resource to anyone that runs or is thinking about running NDB Cluster. The authors, Jesper Wisborg Krogh and Mikiya Okuno, have distilled their vast knowledge of this difficult subject in a detail packed but easily readable book. MySQL Cluster is much more complex in many areas than a regular MySQL server and here you will find all those details. If you run MySQL NDB Cluster then you need this book. The partitioning information in chapter 2 is worth the price of the book alone. I am only a third of the way through the book and have found it as clear and concise as any technical book I have read and it is actually an easy read.

MariaDB and MySQL Common Table Expressions and Window Functions Revealed by Daniel Bartholomew is a slender introduction to CTEs and Window functions. If you were raised on MySQL and do not know either subject well, then I highly recommend this book. CTEs are going to have a big impact on the way developers write sub queries and may cause self joins to become extinct. Windowing functions will bring a new eave of analytical analysis to MySQL. This book is just over 100 pages and has useful examples for novices in either area.

More books! Well yes, there are more MySQL books in the works so save your pocket change in order to buy them when they appear.