Monday, May 20, 2019

Structuring Your Unstructured JSON data

The world seems awash in unstructured, NoSQL data, mainly of the JSON variety.  While this has a great many benefits as far as data mutability and not being locked into a rigid structure there are some things missing that are common in the structured world of SQL databases.

What if there was a way to take this unstructured NoSQL JSON data and cast it, temporarily, into a structured table?  Then you could use all the processing functions and features found in a relation database on you data.  There is a way and it is the JSON_TABLE function.

JSON_TABLE


You can find the documentation for JSON_TABLE here  but there are some examples below that may make learning this valuable function easier than the simple RTFM.

I will be using the world_x dataset for the next example

If we run a simple SELECT JSON_PRETTY(doc) FROM countryinfo LIMIT 1;  the server will return something similar to the following:

{
  "GNP": 828,
  "_id": "ABW",
  "Name": "Aruba",
  "IndepYear": null,
  "geography": {
    "Region": "Caribbean",
    "Continent": "North America",
    "SurfaceArea": 193
  },
  "government": {
    "HeadOfState": "Beatrix",
    "GovernmentForm": "Nonmetropolitan Territory of The Netherlands"
  },
  "demographics": {
    "Population": 103000,
    "LifeExpectancy": 78.4000015258789
  }
}


We can use JSON_TABLE to extract the Name, the Head of State, and the Governmental Form easily with the following query. If you are not used to the MySQL JSON Data type, the "$" references the entire document in the doc column (and doc is out JSON data type column in the table).  And notice that the $.government.HeadOfState and $.government.GovernmentForm are the full path to the keys in the document.

select jt.* FROM countryinfo,
json_table(doc, "$" COLUMNS (
          name varchar(20) PATH "$.Name",
          hofstate varchar(20) PATH '$.government.HeadOfState',
          gform varchar(50) PATH '$.government.GovernmentForm')) as jt
limit 1;

The syntax is JSON_TABLE(expr, path COLUMNS (column_list) [AS] alias) where expr is either a column column from a table or a JSON document passed to the function ('{"Name": "Dave"}' as an example).  Then the desired columns are specified where we name the new column, give it a relational type, and then specify the path of the JSON values we want to cast.

And the results are in the form of a relational table.

+-------+----------+----------------------------------------------+
| name  | hofstate | gform                                        |
+-------+----------+----------------------------------------------+
| Aruba | Beatrix  | Nonmetropolitan Territory of The Netherlands |
+-------+----------+----------------------------------------------+

This is JSON_TABLE in its most basic form.  The only thing I would like to emphasize is that the keys of the JSON data are case sensitive and it is import to check your spelling!

Data Problems

There is also a nice feature to JSON_TABLE where you assign a default value if that key/value pair is missing or yet another value if there is something can not be cast. If we use a non-existent key/value pair here named 'xyz' for an example, we can insert the value '888' for any JSON document missing values.

select jt.* FROM countryinfo,
json_table(doc, "$" COLUMNS (
          name varchar(20) PATH "$.Name",
          hofstate varchar(20) PATH '$.government.HeadOfState',
          xyz int(4) PATH '$.xyz' DEFAULT '999' ON ERROR DEFAULT '888' ON EMPTY,
          gform varchar(50) PATH '$.government.GovernmentForm')) as jt
limit 1;

And how the result looks:

+-------+----------+-----+----------------------------------------------+
| name  | hofstate | xyz | gform                                        |
+-------+----------+-----+----------------------------------------------+
| Aruba | Beatrix  | 888 | Nonmetropolitan Territory of The Netherlands |
+-------+----------+-----+----------------------------------------------+


NULL Handling


Now be careful with Null values. If you change the new line to  xyz int(4) PATH '$.IndepYear' DEFAULT '999' ON ERROR DEFAULT '888' ON EMPTY, we can easily see that the NULL value for Aruba's year of independence will return the default '999' value.  And if you change the path to '$.Name' to try and force the string value into the integer it will trake the ON ERROR path.

And you can assign missing values to NULL 

Nested Path Data


Iterating nested arrays can be painful but JSON_TABLE makes it very simple.  So creating some dummy data, we can start work on digging through the nested information.

select * from a;
+----+-----------------------+
| id | x                     |
+----+-----------------------+
|  1 | {"a": 1, "b": [1, 2]} |
|  2 | {"a": 2, "b": [3, 4]} |
|  3 | {"a": 3, "b": [5, 6]} |

+----+-----------------------+

The query features the NESTED PATH argument

select d.* FROM a, 
JSON_TABLE(x, "$" columns 
        (mya varchar(50) PATH "$.a", 
NESTED PATH "$.b[*]" 
                columns (myb int path '$'))
) as d;

The output.

+-----+-----+
| mya | myb |
+-----+-----+
| 1   |   1 |
| 1   |   2 |
| 2   |   3 |
| 2   |   4 |
| 3   |   5 |
| 3   |   6 |
+-----+-----+
6 rows in set (0.0013 sec)

Not bad but lets add another level.

select * from b;
+----+-----------------------------------------------------+
| id | x                                                   |
+----+-----------------------------------------------------+
|  1 | {"a": 2, "b": [{"c": 101, "d": [44, 55, 66]}]}      |
|  2 | {"a": 1, "b": [{"c": 100, "d": [11, 22, 33]}]}      |
|  3 | {"a": 3, "b": [{"c": 102, "d": [77, 88, 99, 101]}]} |
+----+-----------------------------------------------------+
3 rows in set (0.0009 sec)

So lets embed another level

select d.* FROM b,  
       JSON_TABLE(x, "$" columns          
        (mya varchar(50) PATH "$.a",  
          NESTED PATH "$.b[*]"          
           columns (myc int path '$.c', 
           NESTED PATH '$.d[*]'  
           columns (dpath int path '$'))) 
) as d order by myc;
+-----+-----+-------+
| mya | myc | dpath |
+-----+-----+-------+
| 1   | 100 |    22 |
| 1   | 100 |    33 |
| 1   | 100 |    11 |
| 2   | 101 |    44 |
| 2   | 101 |    55 |
| 2   | 101 |    66 |
| 3   | 102 |    77 |
| 3   | 102 |    88 |
| 3   | 102 |    99 |
| 3   | 102 |   101 |
+-----+-----+-------+
10 rows in set (0.0006 sec)

And we can get ordinal numbers too.

select d.* FROM b,  
         JSON_TABLE(x, "$" columns          
        (mya varchar(50) PATH "$.a",  
           NESTED PATH "$.b[*]"          
           columns (myc int path '$.c', 
           nested path '$.d[*]'  
           columns (dcount for ordinality
           dpath int path '$'))) ) as d 
order by dpath;
+-----+-----+--------+-------+
| mya | myc | dcount | dpath |
+-----+-----+--------+-------+
| 1   | 100 |      1 |    11 |
| 1   | 100 |      2 |    22 |
| 1   | 100 |      3 |    33 |
| 2   | 101 |      1 |    44 |
| 2   | 101 |      2 |    55 |
| 2   | 101 |      3 |    66 |
| 3   | 102 |      1 |    77 |
| 3   | 102 |      2 |    88 |
| 3   | 102 |      3 |    99 |
| 3   | 102 |      4 |   101 |
+-----+-----+--------+-------+
10 rows in set (0.0009 sec)

And not that we have the data structured, we can start using the WHERE clause such as  where myc > 100 and dpath < 100.