Friday, October 12, 2018

MySQL 2018 Community Reception

The 2018 MySQL Community Reception is October 23rd in a new venue at Samovar Tea, 730 Howard Street in San Francisco at 7:00 PM.   Right in the heart of the Moscone Center activities for Oracle OpenWorld and Oracle Code one activities.

The MySQL Community Reception is not part of Oracle OpenWorld or Oracle Code One (you do not need a badge for either event) but you do need to RSVP.  Food, drinks, and a really amazing group of attendees!   And there will be more than tea to drink.

Plus we have a few new surprises this year! 

Wednesday, October 10, 2018

Two More MySQL Books for 2018

Last time I mentioned four great MySQL books for 2018.  I was tactfully reminded of two books I overlooked. First is Dr. Charles Bell's Introducing InnoDB Cluster which I have not read (but it is on order).
Introducing InnoDB Cluster
And last, but not least, is Mikael Ronstrum's MySQL Cluster 7.5 Inside and Out.  This is another book on NDB cluster and is a 'msut have' for those running NDB clusters.

MySQL Cluster 7.5 Inside and Out
I apologize to both authors and take full blame for not mentioning these two find books.  Now I just have to wait for Amazon to send me the copies I ordered!


Tuesday, October 9, 2018

MySQL Books - 2018 has been a very good year

Someone once told me you can tell how healthy a software project is by the number of new books each year.  For the past few years the MySQL community has been blessed with one or two books each year. Part of that was the major shift with MySQL 8 changes but part of it was that the vast majority of the changes were fairly minor and did not need detailed explanations. But this year we have been blessed with four new books.  Four very good books on new facets of MySQL.

Introducing the MySQL 8 Document Store is the latest book from Dr. Charles Bell on MySQL.  If you have read any other of Dr. Chuck's book you know they are well written with lots of examples.  This is more than a simple introduction with many intermediate and advanced concepts covered in detail.

Introducing the MySQL Document Store -- Dr. Charles Bell
Introducing the MySQL 8 Document Store
MySQL & JSON - A Practical Programming Guide by yours truly is a guide for developers who want to get the most of the JSON data type introduced in MySQL 5.7 and improved in MySQL 8.  While I love MySQL's documentation, I wanted to provide detailed examples on how to use the various functions and features of the JSON data type. 

MySQL and JSON A Practical Programming Guide

Jesper Wisborg Krogh is a busy man at work and somehow found the time to author and co-author two books.  The newest is MySQL Connector/Python Revealed: SQL and NoSQL Data Storage Using MySQL for Python Programmers which I have only just received.  If you are a Python Programmer (or want to be) then you need to order your copy today.  A few chapters in and I am already finding it a great, informative read.
MySQL Connector/Python Revealed

Jesper and Mikiya Okuno produced a definitive guide to the MySQL NDB cluster with Pro MySQL NDB Cluster.  NDB cluster is often confusing and just different enough from 'regular' MySQL to make you want to have a clear, concise guidebook by your side.  And this is that book.

Pro MySQL NDB Cluster

Recommendation

Each of these books have their own primary MySQL niche (Docstore, JSON, Python & Docstore, and NDB Cluster) but also have deeper breath in that they cover material you either will not find in the documentation or have to distill that information for yourself.  They not only provide valuable tools to learn their primary facets of technology but also provide double service as a reference guide. 



Monday, October 1, 2018

Prepared Statements for MySQL: PDO, MySQLi, and X DevAPI

Recently I ran across a prominent PHP Developer who incorrectly claimed that only PDO allows binding values to variables for prepared statements.  A lot of developer use prepared statements to reduce the potential of SQL Injection and it is a good first step.  But there are some features that you do no kno

What is a Prepared Statement?


The MySQL Manual states The MySQL database supports prepared statements. A prepared statement or a parameterized statement is used to execute the same statement repeatedly with high efficiency.

So far, so good. Well there is also a performance issue to consider too.  From the same source The prepared statement execution consists of two stages: prepare and execute. At the prepare stage a statement template is sent to the database server. The server performs a syntax check and initializes server internal resources for later use.

So it is a two step process.  Set up the query as a template and then plug in the value. If you need to reuse the query, just plug in a new value into the template.

So lets look at how it is done.

PDO


On PHP.NET, there are a lot of really great examples. Question marks (?) are used as placeholders that will be filled in at execution time. 

$sth $dbh->prepare('SELECT name, colour, calories
    FROM fruit
    WHERE calories < ? AND colour = ?'
);$sth->execute(array(150'red'));$red $sth->fetchAll();$sth->execute(array(175'yellow'));$yellow $sth->fetchAll();

So that is the basics. But what do they look like with the other two extensions?


MySQLi


So what does the MySQLi version look like? Once again question marks are used as placeholders.

$stmt $mysqli->prepare("INSERT INTO CountryLanguage VALUES (?, ?, ?, ?)");

$stmt->bind_param('sssd'$code$language$official$percent);$code 'DEU';$language 'Bavarian';$official "F";$percent 11.2;/* execute prepared statement */$stmt->execute();printf("%d Row inserted.\n"$stmt->affected_rows);

But what is that sssd stuff?  That is where you declare the type of variable you are want to use.  Use 's' for string, 'i' for integer, 'd' for double, and 'b' for a blob (binary large object).  So you get the advantage of type checking.

X DevAPI

The much newer X DevAPI is for the new X Protocol and the MySQL Document Store.  Unlike the other two examples it is not Structured Query Language (SQL) based.


$res $coll->modify('name like :name')->arrayInsert('job[0]''Calciatore')->bind(['name' => 'ENTITY'])->execute();

$res $table->delete()->orderby('age desc')->where('age < 20 and age > 12 and name != :name')->bind(['name' => 'Tierney'])->limit(2)->execute();

Note that this is not an object relational mapper as it is the protocol itself and not something mapping the object to the SQL.

Wrap Up

So now you know how to use prepared statements with all three PHP MySQL Extensions.


Thursday, September 27, 2018

MySQL Track at Southern California Linux Expo 2019 CFP Open

The Call For Papers  for the next Southern California Linux Expo aka SCaLE is open and I need your help with the MySQL Track.  We have had a MySQL track for the past few years and this year I have gotten permission from the organizers of SCaLE to get a group of MySQL community members to review the talks for this track.  This year was pretty good but to make 2019 even better we need more submissions from more people AND THIS MEANS YOU!!

The link above has the details on how to register to submit a talk submission and the process is fairly simple.  But if you would like help with your submission, want to 'rubber duck' ideas, or want a quick review before you submission please contact me  (@stoker, david.stokes @ Oracle.com) or find me at a show.

So what type of talks do we need?  We need to cover material for newbies, intermediate, and the experienced.  There are lots of developers, Devops folks, hobbyists, and students just waiting for you to share your knowledge with them. For a large show, we have a fairly intimate talk space which is friendly to first time presenters.

Some ideas:

1. So how do you do backups and restores?  What sort of things get deleted and what steps do you take to keep your data safe? Policies? Procedures?  How does a developer who fudged a row/table get their data back in your organization?

2. How do you reduce SQL injects, the N+1 problem and 0(N) searches in your code?

3. How is your Kubernetes/Docker/Ansible environment for your developers managed?

4. Five things about being an 'accidental' MySQL DBA I wish I knew when I took over the databases

5. MySQL Replication best practices.

6. Data normalization how-to

7. Query optimization for novices. Query optimization for the advanced SQL coder.

And anything else  you think that at least one other person would like to hear about (and there are more - believe me).

So please make SCaLE 17x even better! You have until the end of October to submit.

Thursday, September 6, 2018

JSON Paths and the MySQL JSON Functions

I wrote MySQL and JSON: A Practical Programming Guide to help developers find their way around the MySQL JSON data type and the supporting functions. The MySQL Documentation on the subject is very good but I had to puzzle through the examples to see how things worked.  I might be a bit 'thick' but good examples always make things easier.  Others seem to have similar difficulties.
MySQL and JSON a Practical Programming Guide should be on your desk as a handy reference to MySQL's JSON data type.

 There was a recent post on Stackoverflow.com where someone had this JSON document:

{  
   "textures":[  
      {  
         "label":"test",
         "types":{  
            "t_1":0,
            "t_2":0
         }
      },
      {  
         "label":"KEK",
         "types":{  
            "t_1":0,
            "t_2":0
         }
      }
   ],
   "weapons":[  
      {  
         "name":"WW_SHT",
         "ammo":0
      },
      {  
         "name":"WW_DSS",
         "ammo":0
      }
   ]
}

And they wanted want to update t_1 to change value from 0 to 1.  I will not repost their code but to my eyes it looked convoluted. 

So How Do You Get There From Here?

Trying to figure out how to get down to a key or value is easy.  To see the top level keys simply use JSON_KEYS()

SELECT JSON_KEYS(doc) FROM zz1 LIMIT 1;

'[\"weapons\", \"textures\"]'

But how to get deeper??

By using select doc->>"$.textures[*]" from zz1 limit 1; we get all the info under the textures key.

[{"label": "test", "types": {"t_1": 0, "t_2": 0}}, {"label": "KEK", "types": {"t_1": 0, "t_2": 0}}]

Okay so we are getting closer to the target!    Now to take one more step closer with select doc->"$.textures[*].types" from zz1 limit 1;

[{"t_1": 0, "t_2": 0}, {"t_1": 0, "t_2": 0}]

I like to use JSON_PRETTY to get a enhanced view:

select json_pretty(doc>"$.textures[*].types")  
from zz1 limit 1;
 [
  {
    "t_1": 0,
    "t_2": 0
  },
  {
    "t_1": 0,
    "t_2": 0
  }


But there are two t_1s!

The next step is to get just those t_1 values and that is done with select 
doc->"$.textures[*].types.t_1" from zz1;

Which gives us:

 [0, 0]

Not really confidence inspiring ,eh? So lets change one of those zeros to a nine.

update zz1 set doc = json_set(doc,"$.textures[0].types.t_1",9);
 
So did we change the first or the second t_1??

select json_pretty(doc>"$.textures[*].types")  
from zz1 ;
 [
  {
    "t_1": 9,
    "t_2": 0
  },
  {
    "t_1": 0,
    "t_2": 0
  }

But lets double check and change the second t_1 also. 

update zz1 set doc = json_set(doc,"$.textures[1].types.t_1",7) ;

Hopefully that second one will end up with a value of seven.

select 
json_pretty(doc->"$.textures[*].types") from zz1;
 [
  {
    "t_1": 9,
    "t_2": 0
  },
  {
    "t_1": 7,
    "t_2": 0
  }

So now we can get to the exact values we want.

Annotated JSON Document

So lets look at the section of the JSON document and annotate in red the paths in the document.

"textures":[  
      {  -- textures[0]
         "label":"test",  
         "types":{  -- textures[0].types
            "t_1":0,--textures[0].types.t_1
            "t_2":0
         }
      },
      {  -- textures[1]
         "label":"KEK",
         "types":{  
            "t_1":0,--textures[1].types.t_1
            "t_2":0
         }
      }
   ]
Hopefully this will ease someone's confusion down the line.  And please do buy my book.

Wednesday, September 5, 2018

Loading Unknown CSV Data into a Table Easily

Peter Zaitsev of Percona posted a Tweet about an interactive book Select Star SQL which 'aims to be the best place on the internet for learning SQL'.  This interactive book uses a data set that can be downloaded or you could work the queries online.  I showed the website to a neighbor who was asking 'what is the best way to learn SQL'.   The neighbor wanted their own copy of the data and asked how do you turn a raw CSV file into a table.

Well, that can get complicated.

CSV


CSV files use commas to designate each field in the file.  This is a great idea until someone plants a comma in the middle of a field but it is sort of a least common denominator way of passing data between systems.

But a CSV file does not directly fit into a relational table.  You could shove all the data into a JSON column but that is not what we want.   What is needed is a peek at the data in the CSV file to see what the various fields are and possible hints at the types of data therein.  So download the data and please follow along.

Traditionally it took a bit of data sleuthing to examine the CSV file to determine how the data needed to be handled to fit into the relational table even before you fired up your database client. 

If you open the file with a spreadsheet program like Libre Calc they usually will make a best effort at putting each field in its own column.  Luckily this data set has the first row which names the columns. Are here are the column headings:

Execution
Date of Birth
Date of Offence
Highest Education Level
Last Name
First Name
TDCJ Number
Age at Execution
Date Received
Execution Date
Race
County
Eye Color
Weight
Height
Native County
Native State
Last Statement

Looking at the first few rows under the headings we can get a good glimpse at the data types for the columns.  What we want to do on the first pass is to see if values are strings (CHAR), Integers, Realm and the like.  It also helps to look for some field that can be used for a primary key, hopefully a column with unique values.

Execution  INT Primary key??  Unique??
Date of Birth DATE
Date of Offence DATE
Highest Education Level INT
Last Name CHAR
First Name CHAR
TDCJ Number INT
Age at Execution INT
Date Received   DATE
Execution Date DATE
Race CHAR
County CHAR
Eye Color CHAR
Weight INT
Height CHAR
Native County  CHAR
Native State CHAR
Last Statement  LONG TEXT

Second Pass


In this case it looks like we could use Execution as the primary key.  Now we need to go through the data to see what fields need special handling.  Are the possible INT fields all positive values (unsigned), could we use ENUMs anywhere (Maybe Native Country or Native State or Eye Color). 

Also, for metric fans, the Height column uses imperial measurements which may not fit a company policy of metric only (or cubits). Plus Height has two components -- Feet and Inches -- that we may need to break down later or we may want to record all this data in inches. We will not cover that in this blog post (but we could cover if I get enough folks asking for how to do this).

Also we need to examine each of the CHAR fields to check their maximum lengths.  This can be dangerous for future addendum where suddenly the maximum length set by our first pass at the data is swamped by some much longer fields.  If the only Last Names we see are Smith and Jones, setting the maximum length of the CHAR field works until we run into someone named Buchanan or Schwarzenegger.    But in this case we have a closed system (NO updates to the data) we will not worry about this, at least for his data

So making a best guess, lets assign lengths to the CHAR fields.

Execution  INT Primary key??  Unique??
Date of Birth DATE
Date of Offence DATE
Highest Education Level INT  unsigned
Last Name CHAR  25
First Name CHAR  25
TDCJ Number INT
Age at Execution INT
Date Received   DATE
Execution Date DATE
Race CHAR  10
County CHAR 20
Eye Color CHAR 10
Weight INT
Height CHAR 10
Native County  CHAR 25
Native State CHAR 25
Last Statement  LONG TEXT

Please note that I will be using VARCHAR instead of CHAR in the above and use CHAR just for clarity.

Create Our Table

We could of course create a table like that below and then load the data file.



MySQL Workbench makes quick work of creating the schema.  But Workbench will also has a wizard to open the CSV, let us name the columns, and import the data.

The Easy Way To Go From CSV to Relational Table

MySQL Workbench's import wizard is reached by clicking on a table icon and selecting Table Data Import Wizard.

Tell the import wizard where the CSV is located


Start the table import wizard

The CSV is split up by fields and we could edit field types 
And the data is loaded



Now I have a big warning in that column names are copied from the CSV table first row.  We could have changed Date of Birth to something like Birth_date but chose not to. This means we have to make queries like SELECT `Date of Birth` FROM tx_deathrow.

But now we have a local copy of the data to use when working through the interactive book.


Wednesday, August 1, 2018

A Nice Introduction to MySQL Window Functions III

Windowing Functions can get quite complex very quickly when you start taking advantage of the frame clause. Ranges and rows can get confusing.  So for review lets look at how the specification looks:

Window_spec:
   [window name] [partition clause] [order clause] [frame clause]

That looks simple. And them come terms like UNBOUNDED PRECEDING that could put a knot in your gut.  The manual is not exactly written to help novices in this area get up to speed.  But don't panic.  If you work through the examples that follow (and please do the preceding part of this series before trying these examples) you will have a better appreciation of what is going on with window function.

The Frame Clause


So the frame clause is optional in the window function.  A frame is considered a subset of the current partition and defines that subset.  Frames are determined with respect to the current row. This allows grouping of data within a partition depending on the current row in the partition.  If the frame is defined as all rows  from the start of the end of the partition you can computer running totals for each row.  Or the frame can be defined as extending a desired number of rows  either side of the current row which lets you compute rolling averages.

The first example in this series had the window defined as W OVER() which does not have a frame clause and computes over the entire column.

mysql> select x, 
              sum(x) over() from x;
+------+---------------+
| x    | sum(x) over() |
+------+---------------+
|    1 |            55 |
|    2 |            55 |
|    3 |            55 |
|    4 |            55 |
|    5 |            55 |
|    6 |            55 |
|    7 |            55 |
|    8 |            55 |
|    9 |            55 |
|   10 |            55 |
+------+---------------+
10 rows in set (0.00 sec)

Unbounded

There is usually more than one way to do things. OVER() can also be written as WINDOW w AS (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) or WINDOW w AS (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).

Uh, wha, huh?

Besides the confusion of windows/frames and range/rows, the wording gets tricky.   UNBOUNDED should be taken to mean as 'everything' so UNBOUNDED BEFORE means everything before and UNBOUNDED AFTER means everything after.



mysql> SELECT x, 
             sum(x) over() as 'over', 
             sum(x) OVER a as 'row',
             sum(x) OVER b AS 'range' 
             FROM x 
             window 
a AS (rows between unbounded preceding and unbounded following),    b AS (range between unbounded preceding and unbounded following);
+------+------+------+-------+
| x    | over | row  | range |
+------+------+------+-------+
|    1 |   55 |   55 |    55 |
|    2 |   55 |   55 |    55 |
|    3 |   55 |   55 |    55 |
|    4 |   55 |   55 |    55 |
|    5 |   55 |   55 |    55 |
|    6 |   55 |   55 |    55 |
|    7 |   55 |   55 |    55 |
|    8 |   55 |   55 |    55 |
|    9 |   55 |   55 |    55 |
|   10 |   55 |   55 |    55 |
+------+------+------+-------+
10 rows in set (0.01 sec)

So in the above example the windows are framed so that all the rows are used for the summation of the column when using rows and range.

ROW vs RANGE

Right now you are probably wondering what the difference is between RANGE and ROW.  Let's modify the last example slightly. So we replace between current row and unbounded following as the core of the frame specification.

mysql> SELECT x, 
         sum(x) over() as 'over', 
         sum(x) OVER a as 'row', 
         sum(x) OVER b AS 'range' 
         FROM x 
  window a AS (rows between current row and unbounded following), 
         b AS (range between current row and unbounded following);
+------+------+------+-------+
| x    | over | row  | range |
+------+------+------+-------+
|    1 |   55 |   55 |    55 |
|    2 |   55 |   54 |    55 |
|    3 |   55 |   52 |    55 |
|    4 |   55 |   49 |    55 |
|    5 |   55 |   45 |    55 |
|    6 |   55 |   40 |    55 |
|    7 |   55 |   34 |    55 |
|    8 |   55 |   27 |    55 |
|    9 |   55 |   19 |    55 |
|   10 |   55 |   10 |    55 |
+------+------+------+-------+
10 rows in set (0.00 sec)

The row column now counts down while the range column is unchanged.  What happened? So rows have a frame defined by the beginning and ending row position.  The first row has a sum of all ten items in the column. But the second row has a sum starting with the value of 2 and ending at 10, the third row sums 3 to 10, etcetera.

For range, the frame is defined by rows within a value range of an ORDER BY clause.  But where is the ORDER BY??  Well in this case it is implied to be the entire partition so it takes all the column values.

Range gets a little more differentiated when you use it a little more creatively.

mysql> SELECT x,
             sum(x) over() as 'over', 
             sum(x) OVER a as 'row', 
             sum(x) OVER b AS 'range' 
             FROM x 
             window 
               a AS (rows between current row and 1 following), 
               b AS (order by x 
                     range between current row and 1 following);
+------+------+------+-------+
| x    | over | row  | range |
+------+------+------+-------+
|    1 |   55 |    3 |     3 |
|    2 |   55 |    5 |     5 |
|    3 |   55 |    7 |     7 |
|    4 |   55 |    9 |     9 |
|    5 |   55 |   11 |    11 |
|    6 |   55 |   13 |    13 |
|    7 |   55 |   15 |    15 |
|    8 |   55 |   17 |    17 |
|    9 |   55 |   19 |    19 |
|   10 |   55 |   10 |    10 |
+------+------+------+-------+
10 rows in set (0.06 sec)

The above will sum the current row and the next row.  With row the frame becomes rows between current row and 1 following.  But this time for range the order has to be specified and the frame becomes  order by x range between current row and 1 following. 

It is possible to specify bigger ranges.  

mysql> SELECT x, 
          sum(x) over() as 'over', 
           sum(x) OVER a as 'row', 
           sum(x) OVER b AS 'range' 
          FROM x 
      window a AS (rows between 1 preceding and 3 following), 
        b AS (order by x range between 1 preceding and 3 following);
+------+------+------+-------+
| x    | over | row  | range |
+------+------+------+-------+
|    1 |   55 |   10 |    10 |
|    2 |   55 |   15 |    15 |
|    3 |   55 |   20 |    20 |
|    4 |   55 |   25 |    25 |
|    5 |   55 |   30 |    30 |
|    6 |   55 |   35 |    35 |
|    7 |   55 |   40 |    40 |
|    8 |   55 |   34 |    34 |
|    9 |   55 |   27 |    27 |
|   10 |   55 |   19 |    19 |
+------+------+------+-------+
10 rows in set (0.00 sec)


Hopefully this takes a little bit of the mystery out of windowing functions for novices.

Next time I hope to look into some time series data and windowing functions.

Tuesday, July 31, 2018

A Kind Introduction to MySQL Windowing Functions II

Before I take up from the last blog, I need to introduce RANGE and ROWs.  Windows over data can be framed and this is where things can get wild and woolly.   Table x has a column named x (me being overly creative again) that has the values one through 10.  If we sum the values of x we can get different values depending on how the frame is constructed.

If the frame for the window is defined as a the range between 'unbounded preceding and current row' the value for the sum of x will the sum for the entire column of data.  However, if the frame is defined as the rows between 'unbounded preceding and current row' it will sum up the values of the current row and the values of the rows that came before; 1, 1+2, 1+2+3. etc.


mysql> SELECT x, sum(x) over w as 'rows',
    -> sum(x) over y as 'range'
    -> from x
    -> window w as 
           (rows between unbounded preceding and current row),
    ->        y as 
           (range between unbounded preceding and current row);
+------+------+-------+
| x    | rows | range |
+------+------+-------+
|    1 |    1 |    55 |
|    2 |    3 |    55 |
|    3 |    6 |    55 |
|    4 |   10 |    55 |
|    5 |   15 |    55 |
|    6 |   21 |    55 |
|    7 |   28 |    55 |
|    8 |   36 |    55 |
|    9 |   45 |    55 |
|   10 |   55 |    55 |
+------+------+-------+
10 rows in set (0.00 sec)

The terminology of frames and windows gets a little confusing and sound like a home remodeling project.  For now consider 'unbounded' to be everything in the column, so unbounded preceding would be everything in the column before this row. 


mysql> SELECT x, 
       sum(x) over w as 'rows', 
       sum(x) over y as 'range' 
       from x 
       window w as 
         (rows between current row and unbounded following), 
         y as (range between unbounded preceding and current row);
+------+------+-------+
| x    | rows | range |
+------+------+-------+
|    1 |   55 |    55 |
|    2 |   54 |    55 |
|    3 |   52 |    55 |
|    4 |   49 |    55 |
|    5 |   45 |    55 |
|    6 |   40 |    55 |
|    7 |   34 |    55 |
|    8 |   27 |    55 |
|    9 |   19 |    55 |
|   10 |   10 |    55 |
+------+------+-------+
10 rows in set (0.00 sec)

So unbounded following would take all the values in the column after the current.  So the rows value where x is equal to 2 of 54 is the grand total sum of 55 less the value of the first row or 54 (and x =3 is 55 less 1 & 2).

More frames and windows next time!



More State Stuff


Last  time was the start of a series on windowing functions that were introduced with MySQL 8. This entry will delve a little bit deeper into this realm.  The data used for these examples starts off with the World database.

Population By State


Previously the data showed a total population for each state but listed an entry for each city in that state, er, district.  It is quite easy to get a per state total by adding a simple GROUP BY to the query.


mysql> SELECT district, Sum(Population) OVER w 
FROM city 
WHERE CountryCode = 'USA' 
GROUP BY District  
WINDOW w AS (PARTITION BY District) 
LIMIT 10;
+----------------------+------------------------+
| district             | Sum(Population) OVER w |
+----------------------+------------------------+
| Alabama              |                 242820 |
| Alaska               |                 260283 |
| Arizona              |                1321045 |
| Arkansas             |                 183133 |
| California           |                3694820 |
| Colorado             |                 554636 |
| Connecticut          |                 139529 |
| District of Columbia |                 572059 |
| Florida              |                 735167 |
| Georgia              |                 416474 |
+----------------------+------------------------+
10 rows in set (0.00 sec)


Going Back to Ohio


But what if we wanted a little more statistical information on a state's population?  Well, there several functions just waiting for you.

mysql> select name, Population as 'Pop', 
       SUM(Population) OVER w AS 'sum', 
       RANK() over w as 'rank', 
       percent_rank() OVER w as '%rank' 
       from city where District='Ohio' 
       WINDOW w AS (ORDER BY Population);
+------------+--------+---------+------+-------+
| name       | Pop    | sum     | rank | %rank |
+------------+--------+---------+------+-------+
| Dayton     | 166179 |  166179 |    1 |     0 |
| Akron      | 217074 |  383253 |    2 |   0.2 |
| Toledo     | 313619 |  696872 |    3 |   0.4 |
| Cincinnati | 331285 | 1028157 |    4 |   0.6 |
| Cleveland  | 478403 | 1506560 |    5 |   0.8 |
| Columbus   | 711470 | 2218030 |    6 |     1 |
+------------+--------+---------+------+-------+
6 rows in set (0.01 sec)


Ohio is used here as it has just enough data to illustrate some concepts.  Looking at the cities in that state starting with the largest population is done by creating a window that is ordered by Population, biggest first.  Columbus has the largest population and its rank is '1'.

The PERCENTAGE_RANK  is the percentage of scores in the window that are equal or lower to the value in the row, excluding the highest value, for the rank on a range from zero to one for that row.  For a fun exercise , rerun the above query with the desc removed from the window definition (ORDER BY Population).

Differences

The functions first_value and last_value provide a way to see how much small the population is in the other cities than Columbus. To do this the current row, or last_value, is subtracted from the  first value  - Columbus's 711,470 - to provide the answer.

 mysql> select name, 
         Population as 'Pop', 
         SUM(Population) OVER w AS 'sum', 
        (first_value(Population) over w - last_value(Population) over w) as 'diff'  
        from city 
        where District='Ohio' 
        WINDOW w AS (ORDER BY Population desc) ;
+------------+--------+---------+--------+
| name       | Pop    | sum     | diff   |
+------------+--------+---------+--------+
| Columbus   | 711470 |  711470 |      0 |
| Cleveland  | 478403 | 1189873 | 233067 |
| Cincinnati | 331285 | 1521158 | 380185 |
| Toledo     | 313619 | 1834777 | 397851 |
| Akron      | 217074 | 2051851 | 494396 |
| Dayton     | 166179 | 2218030 | 545291 |
+------------+--------+---------+--------+
6 rows in set (0.00 sec)

But how much smaller are cities three to six from the second city, Cleveland.  The NTH_VALUE(expression, n) allows us to subtract the smaller values from the second entry.

mysql> select name, Population as 'Pop', 
       SUM(Population) OVER w AS 'sum', 
   (first_value(Population) over w - last_value(Population) over w) 
            as 'diff Columbus', 
   (nth_value(Population,2) over w - last_value(Population) over w) 
            as 'diff Cleveland'  
      from city where District='Ohio' 
       WINDOW w AS (ORDER BY Population desc) ;
+------------+--------+---------+---------------+----------------+
| name       | Pop    | sum     | diff Columbus | diff Cleveland |
+------------+--------+---------+---------------+----------------+
| Columbus   | 711470 |  711470 |             0 |           NULL |
| Cleveland  | 478403 | 1189873 |        233067 |              0 |
| Cincinnati | 331285 | 1521158 |        380185 |         147118 |
| Toledo     | 313619 | 1834777 |        397851 |         164784 |
| Akron      | 217074 | 2051851 |        494396 |         261329 |
| Dayton     | 166179 | 2218030 |        545291 |         312224 |
+------------+--------+---------+---------------+----------------+
6 rows in set (0.01 sec)

But how much bigger is each city than the next biggest?? For that we need to use LAG(),  LAG() takes the value of column from the row before (So Cleveland's value for LAG is Columbus's Population) and the default lag is 1.  The corresponding LEAD() takes the value from the row after.  So now it is easy to see that Toledo is only 17,666 behind Cincinnati.

mysql> select name, 
       Population as 'Pop', 
       lag(Population,1) over w as 'lag', 
       lead(Population,1) over w as 'lead', 
       Population - lag(Population) over w as 'diff'  
       from city 
       WHERE District = 'Ohio' 
       window w as (ORDER BY Population desc);
+------------+--------+--------+--------+---------+
| name       | Pop    | lag    | lead   | diff    |
+------------+--------+--------+--------+---------+
| Columbus   | 711470 |   NULL | 478403 |    NULL |
| Cleveland  | 478403 | 711470 | 331285 | -233067 |
| Cincinnati | 331285 | 478403 | 313619 | -147118 |
| Toledo     | 313619 | 331285 | 217074 |  -17666 |
| Akron      | 217074 | 313619 | 166179 |  -96545 |
| Dayton     | 166179 | 217074 |   NULL |  -50895 |
+------------+--------+--------+--------+---------+
6 rows in set (0.00 sec)



Dividing Up Into Buckets


Another useful function is the NTILE which divvies up the data into a certain number of groups or buckets.  Be sure to use ORDER BY to ensure the rows you have selected are ordered correctly.  Below the cities are split up into a group with four buckets and a group with three buckets.  

mysql> SELECT name, 
              Population as 'Pop', 
              NTILE(4) over w as 'ntile4', 
              NTILE(3) over w as 'ntile3' 
              from city 
              where District = 'Ohio' 
              Window w as (ORDER BY Population desc);
+------------+--------+--------+--------+
| name       | Pop    | ntile4 | ntile3 |
+------------+--------+--------+--------+
| Columbus   | 711470 |      1 |      1 |
| Cleveland  | 478403 |      1 |      1 |
| Cincinnati | 331285 |      2 |      2 |
| Toledo     | 313619 |      2 |      2 |
| Akron      | 217074 |      3 |      3 |
| Dayton     | 166179 |      4 |      3 |
+------------+--------+--------+--------+
6 rows in set (0.00 sec)