Tuesday, July 31, 2018

A Kind Introduction to MySQL Windowing Functions II

Before I take up from the last blog, I need to introduce RANGE and ROWs.  Windows over data can be framed and this is where things can get wild and woolly.   Table x has a column named x (me being overly creative again) that has the values one through 10.  If we sum the values of x we can get different values depending on how the frame is constructed.

If the frame for the window is defined as a the range between 'unbounded preceding and current row' the value for the sum of x will the sum for the entire column of data.  However, if the frame is defined as the rows between 'unbounded preceding and current row' it will sum up the values of the current row and the values of the rows that came before; 1, 1+2, 1+2+3. etc.


mysql> SELECT x, sum(x) over w as 'rows',
    -> sum(x) over y as 'range'
    -> from x
    -> window w as 
           (rows between unbounded preceding and current row),
    ->        y as 
           (range between unbounded preceding and current row);
+------+------+-------+
| x    | rows | range |
+------+------+-------+
|    1 |    1 |    55 |
|    2 |    3 |    55 |
|    3 |    6 |    55 |
|    4 |   10 |    55 |
|    5 |   15 |    55 |
|    6 |   21 |    55 |
|    7 |   28 |    55 |
|    8 |   36 |    55 |
|    9 |   45 |    55 |
|   10 |   55 |    55 |
+------+------+-------+
10 rows in set (0.00 sec)

The terminology of frames and windows gets a little confusing and sound like a home remodeling project.  For now consider 'unbounded' to be everything in the column, so unbounded preceding would be everything in the column before this row. 


mysql> SELECT x, 
       sum(x) over w as 'rows', 
       sum(x) over y as 'range' 
       from x 
       window w as 
         (rows between current row and unbounded following), 
         y as (range between unbounded preceding and current row);
+------+------+-------+
| x    | rows | range |
+------+------+-------+
|    1 |   55 |    55 |
|    2 |   54 |    55 |
|    3 |   52 |    55 |
|    4 |   49 |    55 |
|    5 |   45 |    55 |
|    6 |   40 |    55 |
|    7 |   34 |    55 |
|    8 |   27 |    55 |
|    9 |   19 |    55 |
|   10 |   10 |    55 |
+------+------+-------+
10 rows in set (0.00 sec)

So unbounded following would take all the values in the column after the current.  So the rows value where x is equal to 2 of 54 is the grand total sum of 55 less the value of the first row or 54 (and x =3 is 55 less 1 & 2).

More frames and windows next time!



More State Stuff


Last  time was the start of a series on windowing functions that were introduced with MySQL 8. This entry will delve a little bit deeper into this realm.  The data used for these examples starts off with the World database.

Population By State


Previously the data showed a total population for each state but listed an entry for each city in that state, er, district.  It is quite easy to get a per state total by adding a simple GROUP BY to the query.


mysql> SELECT district, Sum(Population) OVER w 
FROM city 
WHERE CountryCode = 'USA' 
GROUP BY District  
WINDOW w AS (PARTITION BY District) 
LIMIT 10;
+----------------------+------------------------+
| district             | Sum(Population) OVER w |
+----------------------+------------------------+
| Alabama              |                 242820 |
| Alaska               |                 260283 |
| Arizona              |                1321045 |
| Arkansas             |                 183133 |
| California           |                3694820 |
| Colorado             |                 554636 |
| Connecticut          |                 139529 |
| District of Columbia |                 572059 |
| Florida              |                 735167 |
| Georgia              |                 416474 |
+----------------------+------------------------+
10 rows in set (0.00 sec)


Going Back to Ohio


But what if we wanted a little more statistical information on a state's population?  Well, there several functions just waiting for you.

mysql> select name, Population as 'Pop', 
       SUM(Population) OVER w AS 'sum', 
       RANK() over w as 'rank', 
       percent_rank() OVER w as '%rank' 
       from city where District='Ohio' 
       WINDOW w AS (ORDER BY Population);
+------------+--------+---------+------+-------+
| name       | Pop    | sum     | rank | %rank |
+------------+--------+---------+------+-------+
| Dayton     | 166179 |  166179 |    1 |     0 |
| Akron      | 217074 |  383253 |    2 |   0.2 |
| Toledo     | 313619 |  696872 |    3 |   0.4 |
| Cincinnati | 331285 | 1028157 |    4 |   0.6 |
| Cleveland  | 478403 | 1506560 |    5 |   0.8 |
| Columbus   | 711470 | 2218030 |    6 |     1 |
+------------+--------+---------+------+-------+
6 rows in set (0.01 sec)


Ohio is used here as it has just enough data to illustrate some concepts.  Looking at the cities in that state starting with the largest population is done by creating a window that is ordered by Population, biggest first.  Columbus has the largest population and its rank is '1'.

The PERCENTAGE_RANK  is the percentage of scores in the window that are equal or lower to the value in the row, excluding the highest value, for the rank on a range from zero to one for that row.  For a fun exercise , rerun the above query with the desc removed from the window definition (ORDER BY Population).

Differences

The functions first_value and last_value provide a way to see how much small the population is in the other cities than Columbus. To do this the current row, or last_value, is subtracted from the  first value  - Columbus's 711,470 - to provide the answer.

 mysql> select name, 
         Population as 'Pop', 
         SUM(Population) OVER w AS 'sum', 
        (first_value(Population) over w - last_value(Population) over w) as 'diff'  
        from city 
        where District='Ohio' 
        WINDOW w AS (ORDER BY Population desc) ;
+------------+--------+---------+--------+
| name       | Pop    | sum     | diff   |
+------------+--------+---------+--------+
| Columbus   | 711470 |  711470 |      0 |
| Cleveland  | 478403 | 1189873 | 233067 |
| Cincinnati | 331285 | 1521158 | 380185 |
| Toledo     | 313619 | 1834777 | 397851 |
| Akron      | 217074 | 2051851 | 494396 |
| Dayton     | 166179 | 2218030 | 545291 |
+------------+--------+---------+--------+
6 rows in set (0.00 sec)

But how much smaller are cities three to six from the second city, Cleveland.  The NTH_VALUE(expression, n) allows us to subtract the smaller values from the second entry.

mysql> select name, Population as 'Pop', 
       SUM(Population) OVER w AS 'sum', 
   (first_value(Population) over w - last_value(Population) over w) 
            as 'diff Columbus', 
   (nth_value(Population,2) over w - last_value(Population) over w) 
            as 'diff Cleveland'  
      from city where District='Ohio' 
       WINDOW w AS (ORDER BY Population desc) ;
+------------+--------+---------+---------------+----------------+
| name       | Pop    | sum     | diff Columbus | diff Cleveland |
+------------+--------+---------+---------------+----------------+
| Columbus   | 711470 |  711470 |             0 |           NULL |
| Cleveland  | 478403 | 1189873 |        233067 |              0 |
| Cincinnati | 331285 | 1521158 |        380185 |         147118 |
| Toledo     | 313619 | 1834777 |        397851 |         164784 |
| Akron      | 217074 | 2051851 |        494396 |         261329 |
| Dayton     | 166179 | 2218030 |        545291 |         312224 |
+------------+--------+---------+---------------+----------------+
6 rows in set (0.01 sec)

But how much bigger is each city than the next biggest?? For that we need to use LAG(),  LAG() takes the value of column from the row before (So Cleveland's value for LAG is Columbus's Population) and the default lag is 1.  The corresponding LEAD() takes the value from the row after.  So now it is easy to see that Toledo is only 17,666 behind Cincinnati.

mysql> select name, 
       Population as 'Pop', 
       lag(Population,1) over w as 'lag', 
       lead(Population,1) over w as 'lead', 
       Population - lag(Population) over w as 'diff'  
       from city 
       WHERE District = 'Ohio' 
       window w as (ORDER BY Population desc);
+------------+--------+--------+--------+---------+
| name       | Pop    | lag    | lead   | diff    |
+------------+--------+--------+--------+---------+
| Columbus   | 711470 |   NULL | 478403 |    NULL |
| Cleveland  | 478403 | 711470 | 331285 | -233067 |
| Cincinnati | 331285 | 478403 | 313619 | -147118 |
| Toledo     | 313619 | 331285 | 217074 |  -17666 |
| Akron      | 217074 | 313619 | 166179 |  -96545 |
| Dayton     | 166179 | 217074 |   NULL |  -50895 |
+------------+--------+--------+--------+---------+
6 rows in set (0.00 sec)



Dividing Up Into Buckets


Another useful function is the NTILE which divvies up the data into a certain number of groups or buckets.  Be sure to use ORDER BY to ensure the rows you have selected are ordered correctly.  Below the cities are split up into a group with four buckets and a group with three buckets.  

mysql> SELECT name, 
              Population as 'Pop', 
              NTILE(4) over w as 'ntile4', 
              NTILE(3) over w as 'ntile3' 
              from city 
              where District = 'Ohio' 
              Window w as (ORDER BY Population desc);
+------------+--------+--------+--------+
| name       | Pop    | ntile4 | ntile3 |
+------------+--------+--------+--------+
| Columbus   | 711470 |      1 |      1 |
| Cleveland  | 478403 |      1 |      1 |
| Cincinnati | 331285 |      2 |      2 |
| Toledo     | 313619 |      2 |      2 |
| Akron      | 217074 |      3 |      3 |
| Dayton     | 166179 |      4 |      3 |
+------------+--------+--------+--------+
6 rows in set (0.00 sec)







Sunday, July 29, 2018

A Kind Introduction MySQL Windowing Functions Part I


Windowing functions are a critical tool for grouping rows of data that are related to other rows. But they go far beyond the regular aggregate functions found in MySQL 5.7 and earlier. In MySQL 8 you do not have to collapse all the information down into a single output row. Each row can retain its individual identity but the server can analyze the data as a unit.

Statistics and Damned Lies

Finding the total Population of the District Texas from the world.city table is simple. 

SQL> select District, sum(Population)  
from city where district = 'Texas';
+----------+-----------------+
| District | sum(Population) |
+----------+-----------------+
| Texas    |         9208281 |
+----------+-----------------+
1 row in set (0.0068 sec)






 Simple.  But try to expand it to the entire USA and you get problems.

SQL> select District, sum(Population)  
from city where CountryCode = 'USA';
+----------+-----------------+
| District | sum(Population) |
+----------+-----------------+
| New York |        78625774 |
+----------+-----------------+
1 row in set (0.0046 sec)


The results only give out the results for all the cities and lumps them under New York. This is not the desired answer. By the way the only time New York (and New York city in particular) has 78 million people is when they are all on the road in front of me when I am trying to take a taxi to an airport.

With a windowing function it is easy to iterate over a subset of the entire data.  Imagine reading the data through a page with a section cut out to form a window that is just the right size to read only the group of rows desired!

SQL> select name, District, Population, sum(Population) over() as p0, 
     District, sum(population) over( partition by District) as p1
     from city where CountryCode = 'USA' limit 10;
+------------+----------+------------+----------+----------+---------+
| name       | District | Population | p0       | District | p1      |
+------------+----------+------------+----------+----------+---------+
| Birmingham | Alabama  |     242820 | 78625774 | Alabama  |  801519 |
| Montgomery | Alabama  |     201568 | 78625774 | Alabama  |  801519 |
| Mobile     | Alabama  |     198915 | 78625774 | Alabama  |  801519 |
| Huntsville | Alabama  |     158216 | 78625774 | Alabama  |  801519 |
| Anchorage  | Alaska   |     260283 | 78625774 | Alaska   |  260283 |
| Phoenix    | Arizona  |    1321045 | 78625774 | Arizona  | 3178903 |
| Tucson     | Arizona  |     486699 | 78625774 | Arizona  | 3178903 |
| Mesa       | Arizona  |     396375 | 78625774 | Arizona  | 3178903 |
| Glendale   | Arizona  |     218812 | 78625774 | Arizona  | 3178903 |
| Scottsdale | Arizona  |     202705 | 78625774 | Arizona  | 3178903 |
+------------+----------+------------+----------+----------+---------+
10 rows in set (0.0075 sec)


The above query has two windows.  The keyword to notice is OVER().  The window defined is OVER() with nothing within the parenthesis as this should be under stood to mean 'the widow is open wide enough to see all the data'. So sum(Population) over() as p0 will give us the sum of all the Population columns and name the column p0.

The second window is defined as sum(population) over (partition by District) as p1 will provide all of the Population of each district summed in a column named p1.  

Indianapolis 500 winning Lotus driven by Jimmy Clark. Before this car all other winners of the race had the engine in front of the driver.  What does this have to do with Windowing Functions, databases, SQL, or anything else in this blog post. Well, nothing but while at Detroit PHP I stopped by the Henry Ford Museum and took this picture of a very icon car that changed Indy forever.  Windowing Functions can change your SQL skills.

Different Types of Windows

The OVER clause has two forms - window_spec and window_name.  The window_spec option declares the specified window with the parenthesis.  While window_name is a window defined elsewhere in the query.  The send window in the above query  (where the output is named p1) is a window_spec.

So here is a window_name example:

SQL> SELECT District, Sum(Population) OVER w 
    FROM city where CountryCode ='USA' 
    WINDOW w AS (partition by District) limit 10;
+----------+------------------------+
| District | Sum(Population) OVER w |
+----------+------------------------+
| Alabama  |                 801519 |
| Alabama  |                 801519 |
| Alabama  |                 801519 |
| Alabama  |                 801519 |
| Alaska   |                 260283 |
| Arizona  |                3178903 |
| Arizona  |                3178903 |
| Arizona  |                3178903 |
| Arizona  |                3178903 |
| Arizona  |                3178903 |
+----------+------------------------+
10 rows in set (0.0032 sec)


The window was given the name w and then defined as WINDOW w AS (partition by District).   By the way the declaration within the window_name is itself as window_spec.

So what is a window_spec??

 

The definition from the manual (https://dev.mysql.com/doc/refman/8.0/en/window-functions-usage.html) informs that a window_spec is defined as:

window_spec: [window_name] [partition_clause] [order_clause] [frame_clause]

The window_name is an alias that can be used elsewhere in the query.

The partition_clause is how the data is to be divided up into groups. If this is unspecified it makes one big group.  

The order_clause provides the sort_order. Remember the ORDER BY from regular SQL queries?  This is how you can order the groups.

And the frame_clause determines sub groups within the big group.

And as it so often happens, there is a lot of material to be covered under the frame_clause and that will be in a future blog.





Monday, July 23, 2018

Advanced MySQL JSON_TABLE

JSON_TABLE is one of the more complex functions that arrived in MySQL 8.  It takes schemaless JSON data and turns it into a relational table. So your NoSQL data becomes SQL data and you can use good ol' SQL where clauses on that temporary relational table! 

I stated writing about JSON_TABLE here and here last December.  And you can find details on using JSON_TABLE in my book.  The following examples cover what to do when key/value pairs are missing or bad, traversing nested paths, and adding an ordinal number to nested values.  These operations provide a great deal of muscle when wrestling NoSQL data into a relational format.

JSON_TABLE

The first argument to JSON_TABLE is the name of the JSON column in the table.  In the following example '$" denotes the entire JSON data type column or JSON document. 

COLUMNS is used to call out the various key/values and assign them to a temporary 'column' and a definition of that column.  In this example the JSON key Population is being called out as an integer to be named Pop.  Oh, if the values are missing for Population, JSON_TABLE will insert the DEFAULT value, here 999, to the results. Or if there is an error in the data, like a string instead of something integer-ish, it output a NULL, a valid JSON string, or the DEFAULT value, which is used in this case.  And if the field is empty (NULL), a valid JSON string, or the DEFAULT value, as you direct -- '987' in the example that follows.

Example 1 -Handling bad or missing data

This example has four records where two of the records need some direction.  The has a bad value ('fish') and an empty value (no Population key in the JSON column). 

mysql> SELECT name, 
          Info->>"$.Population", 
          Pop FROM city2,   
          JSON_TABLE(Info,"$" COLUMNS 
          ( Pop INT PATH "$.Population" 
          DEFAULT '999' 
          ON ERROR DEFAULT 
          '987' ON EMPTY))  
         AS x1;
+-------+-----------------------+------+
| name  | Info->>"$.Population" | Pop  |
+-------+-----------------------+------+
| alpha | 100                   |  100 |
| beta  | fish                  |  999 |
| delta | 15                    |   15 |
| gamma | NULL                  |  987 |
+-------+-----------------------+------+
4 rows in set, 1 warning (0.00 sec)

So 'fish' becomes '999' and NULL becomes '987'.

Example 2

Nested values in JSON data are very common and there is a NESTED PATH operator to allow extracting those values.  The restaurants data set from the MongoDB world has some ratings scores in an array named 'grades' that are nested.

{"_id": "00005b2176ae0000000000000001",
"name": "Morris Park Bake Shop",
"grades": [
{"date": {"$date": 1393804800000}, "grade": "A", "score": 2}, 
{"date": {"$date": 1378857600000}, "grade": "A", "score": 6}, 
{"date": {"$date": 1358985600000}, "grade": "A", "score": 10}, 
{"date": {"$date": 1322006400000}, "grade": "A", "score": 9}, 
{"date": {"$date": 1299715200000}, "grade": "B", "score": 14}],
"address": {"coord": [-73.856077, 40.848447],
"street": "Morris Park Ave",
"zipcode": "10462", "
"cuisine": "Bakery",
"restaurant_id": "30075445"}

The NESTED PATH operator allows access to each of the grades.

mysql> select aaaa.*  from restaurants,  
             json_table(doc, "$" COLUMNS
             (name char(50) path "$.name",   
              style varchar(50) path "$.cuisine",  
             NESTED PATH '$.grades[*]' COLUMNS 
             (Grading char(10) path "$.grade", 
             Score INT path "$.score"))) 
      as aaaa ;
+--------------------------------+------------+---------+-------+
| name                           | style      | Grading | Score |
+--------------------------------+------------+---------+-------+
| Morris Park Bake Shop          | Bakery     | A       |     2 |
| Morris Park Bake Shop          | Bakery     | A       |     6 |
| Morris Park Bake Shop          | Bakery     | A       |    10 |
| Morris Park Bake Shop          | Bakery     | A       |     9 |
| Morris Park Bake Shop          | Bakery     | B       |    14 |
| Wendy'S                        | Hamburgers | A       |     8 |
| Wendy'S                        | Hamburgers | B       |    23 |
| Wendy'S                        | Hamburgers | A       |    12 |
| Wendy'S                        | Hamburgers | A       |    12 |
| Dj Reynolds Pub And Restaurant | Irish      | A       |     2 |
+--------------------------------+------------+---------+-------+
10 rows in set (0.00 sec)

From here the average scores could be computed with a Windowing Function and a Common Table Expression. A big tip of the hat to LeFred for the following

WITH cte1 AS (SELECT doc->>"$.name" AS 'name',
doc->>"$.cuisine" AS 'cuisine',
        (SELECT AVG(score) FROM 
        JSON_TABLE(doc, "$.grades[*]"
        COLUMNS (score INT PATH "$.score")) as r ) AS avg_score
 FROM restaurants)
 SELECT *, rank() OVER 
  (PARTITION BY cuisine ORDER BY avg_score) AS `rank`
  FROM cte1 
  ORDER by `rank`, avg_score DESC limit 10;


Example 3

Not all the restaurant have the same number of grades. But it is very easy to see the number of grades. We can get ordinal numbers next to each grade with  FOR ORDINALITY


mysql> select aaaa.name, aaaa.ordinal, aaaa.Grading  
       FROM restaurants,  json_table(doc, "$" COLUMNS(
         name char(50) path "$.name",   
         style varchar(50) path "$.cuisine",  
         NESTED PATH '$.grades[*]'    COLUMNS (
           ordinal FOR ORDINALITY, 
           Grading char(10) path "$.grade", 
           Score INT path "$.score"))) 
        as aaaa  limit 10;
+--------------------------------+---------+---------+
| name                           | ordinal | Grading |
+--------------------------------+---------+---------+
| Morris Park Bake Shop          |       1 | A       |
| Morris Park Bake Shop          |       2 | A       |
| Morris Park Bake Shop          |       3 | A       |
| Morris Park Bake Shop          |       4 | A       |
| Morris Park Bake Shop          |       5 | B       |
| Wendy'S                        |       1 | A       |
| Wendy'S                        |       2 | B       |
| Wendy'S                        |       3 | A       |
| Wendy'S                        |       4 | A       |
| Dj Reynolds Pub And Restaurant |       1 | A       |
+--------------------------------+---------+---------+
10 rows in set (0.00 sec)

Conclusion

JSON_TABLE provides a way to structure and manipulate unstructured, NoSQL data.  It is a powerful way to present data at the database level that might hamper an application.  Plus it provides a best of both the SQL and NoSQL worlds that leverages the best of MySQL.


Friday, July 20, 2018

De-Normalization of Your Database with JSON

One of the humbling things about working at Oracle with the various MySQL personnel is that you are often blown away by something one of them says or does.  And that is on a regular basis.  In this case it is Dr. Charles Bell who gave a great series of presentations last June at the Southeast Linuxfest.

In particular he presented in a full formed state some ideas that had been rattling around in my  skull (but no way near as coherent) on how to take advantage of the MySQL JSON data type.  Below are his points from his slide deck.  I was reviewing my notes from his presentation when I realized that this information really needs to be more widely disseminated.   And I would like your feedback on these ideas?

1.. We can use a JSON field to eliminate one of the issues of traditional database solutions: many-to-many-joins
  • This allows more freedom to store unstructured data (data with pieces missing)
  • You still use SQL to work with the data via a database connector but the JSON documents in the table can be manipulated directly in code.
  • Joins can be expensive. Reducing how many places you need to join data can help speed up your queries.  Removing joins may result in some level of denormalization but can result in fast access to the data.
2. Plan For Mutability
  • Schemaless designs are focused on mutability. Build your applications with the ability to modify the document as needed (and within reason)
3. Remove Many-to-Many Relationships

  • Use embedded arrays and lists to store relationships among documents.  This can be as simple as embedding the data in the document or embedding an array of document ids in the document.
  • In the first case data is available as soon as you can read the document and in the second it only takes one additional step to retrieve the data. In cases of seldom read (used) relationships, having the data linked with an array of ids can be more efficient (less data to read on the first pass)
This presentation and others from the Southeast Linuxfest will be available online and I will be sure to post about that when it happens.

And a big thank you to Dr. Chuck for these ideas.

Friday, July 6, 2018

Finding Values with JSON_CONTAINS

There was an interesting but hard to read post on StackOverflow about how 'insert select delete' data from a MySQL JSON data type column.  The first line of the writer's problem is a little confusing '
In order to record user mac_address and count mac_address to restrict user login's pc or notebook to control user available max for example (it's work)' but the examples reveled more about what was desired.
The idea was to track MAC address used by various users and the author of the question was wondering how to up data a JSON Array of values with JSON_INSERT.  INSERT is for inserting and the better choice would be JSON_ARRAY_APPEND or JSON_ARRAY_INSERT.   
But what caught my eye was the second question: Select sql command for json column ? could be example? If I want to check whether mac value exists 'c84wr8492eda' 
Well, here comes a shameless plug for my Book MySQL and JSON - A Practical Programming Guide  as it details how to do this sort of thing.  What is desired is a certain value (c84wr8492eda) and we can find that easily enough.  We know the key that needs to be searched (mac) and the desired MAC address. 
MySQL> select 
json_contains(auth_list,json_quote('c84wr8492eda'),'$.mac') 
from users;
+-------------------------------------------------------------+
| json_contains(auth_list,json_quote('c84wr8492eda'),'$.mac') |
+-------------------------------------------------------------+
|                                                           1 |
+-------------------------------------------------------------+

A better code snipped would be SELECT id FROM users WHERE JSON_CONTAINS(auth_list,JSON_QUOTE('c84wr8492eda'),'$,mac') = 1; as you will probably be acting on the  'id' field with the matching MAC address.   


You can find answers to problems like this in my hands book available from Amazon.com and other book sellers.
The third question 'Delete sql command for json column ? Could be example? if I want to delete a item where mac value is 'c84wr8492eda'' was also unclear.  Delete the entire record or delete the MAC address from the JSON column?   Ah, the joys of StackOverflow.
DELETE FROM users WHERE JSON_CONTAINS(auth_list,JSON_QUOTE('c84wr8492eda'),'$,mac') = 1;  would remove the entire row.  But what about pruning the one item out of the array?  Well there is not a JSON_ARRAY_REMOVE_ONE_ITEM function.    I would want to get the values for auth_list into a string, removing the desired MAC address, and then using JSON_SET to rewrite the column.  But if you have other options, please let me know!