Wednesday, May 24, 2017

Import a JSON Data Set into MySQL

JSON Data Into MySQL

JSON is now a very hot format for sharing data and MySQL's 5.7 Data Set is now a very hot way of storing that data. But I have had some requests on getting a raw JSON data set into MySQL. So lets start with Global Airfields data from the Awesome JSON Datasets collection.

airfields.json

The data is in a file named airfields.json and comes as one very long line. MySQl has now way of taking out the individual JSON documents from that string and putting them in a row. The jq tool will let use see the data broken down into objects. Try jq -C '.[]' airfields.json and you will see individual documents colorized.

{
  "ID": "LFOI",
  "ShortName": "ABBEV",
  "Name": "ABBEVILLE",
  "Region": "FR",
  "ICAO": "LFOI",
  "Flags": 72,
  "Catalog": 0,
  "Length": 1260,
  "Elevation": 67,
  "Runway": "0213",
  "Frequency": 0,
  "Latitude": "N500835",
  "Longitude": "E0014954"
}
{
  "ID": "LFBA",
  "ShortName": "AGENL",
  ....
  "Latitude": "N492658",
  "Longitude": "E0060730"
}

So now we can see the key/value pairs in each objects. So lets save that information into a text file but use the -M not -C option by typing jq -M '.[]' airfields.json > af.

The airport Table

We will need a simple table as in CREATE TABLE airport (doc JSON); to hold the data. Now the data set is fairly small and an text editor like vi can be used to turn the af into a SQL statement. The first line should read INSERT INTO airport (doc) VALUES ('{. The very last line needs to be }');. The second through last {s need to changed to ('{ and the }s (save the last one) need to be come }').

Login to MySQL and type SOURCE af to load the data. I used the s or search operator in vi but I could have used another editor like sed. Then it is time to feed the data into new table.

mysql> source afm
Query OK, 1218 rows affected (0.16 sec)
Records: 1218  Duplicates: 0  Warnings: 0

'
mysql> SELECT doc FROM airport LIMIT 1;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| doc                                                                                                                                                                                                                                     |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"ID": "LFOI", "ICAO": "LFOI", "Name": "ABBEVILLE", "Flags": 72, "Length": 1260, "Region": "FR", "Runway": "0213", "Catalog": 0, "Latitude": "N500835", "Elevation": 67, "Frequency": 0, "Longitude": "E0014954", "ShortName": "ABBEV"} |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Indexes?

We can easily build indexes on columns built on data in the documents. But what keys exist?

mysql> SELECT JSON_KEYS(doc) FROM airport LIMIT 1;
+------------------------------------------------------------------------------------------------------------------------------------------+
| JSON_KEYS(doc)                                                                                                                           |
+------------------------------------------------------------------------------------------------------------------------------------------+
| ["ID", "ICAO", "Name", "Flags", "Length", "Region", "Runway", "Catalog", "Latitude", "Elevation", "Frequency", "Longitude", "ShortName"] |
+------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

We could index all or a combination of any of these keys. Please note that all the records had the same format which can be rare for schemaless data. Let's take Region for an example. Region sounds important enough to need indexing. But is it a good candidate for an index? Here we have to peer at the data and unfortunately every record has 'FR' for a value. Indexing Region does not help pare down the search for specific record or records.

What about Name? Looking at the data shows that there are no duplicates in that field. As would Latitude and Longitude. Heck, longitude and latitude are often used together so we could put both into one index.

mysql> ALTER TABLE airport add column Name CHAR(30) GENERATED ALWAYS AS (doc->"$.Name");
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> CREATE INDEX nameidx ON  airport (Name);
Query OK, 0 rows affected (0.10 sec)
Records: 0  Duplicates: 0  Warnings: 0

And we can test the index easily.

mysql> explain select Name from airport where Name="ABBEVILLE";
+----+-------------+---------+------------+------+---------------+---------+---------+-------+------+----------+-------------+
| id | select_type | table   | partitions | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra       |
+----+-------------+---------+------------+------+---------------+---------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | airport | NULL       | ref  | nameidx       | nameidx | 31      | const |    1 |   100.00 | Using index |
+----+-------------+---------+------------+------+---------------+---------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)


And then the column for Longitude and Latitude.

mysql> ALTER TABLE airport add column longlat CHAR(30) GENERATED always  AS (concat(json_unquote(doc->"$.Longitude"),"X",json_unquote(doc->"$.Latitude"))) stored;
Query OK, 1218 rows affected (0.39 sec)
Records: 1218  Duplicates: 0  Warnings: 0

The index for this new column's index is left up to you for an exercise.

But now we have a table built from raw JSON formatted data with indexes we can use for quick searches.