Wednesday, August 14, 2019

Improved MySQL Query Performance With InnoDB Mutli Value Indexes

Multi-Valued Indexes are going to change the way you think about using JSON data and the way you architect your data. Before MySQL 8.0.17 you could store data in JSON arrays but trying to search on that data in those embedded arrays was tricky and usually required a full table scan.  But now it is easy and very quick to search and to access the data in JSON arrays.

Multi-Valued Indexes

A Multi-Valued Index (MVI) is a secondary index defined on a column made up of an array of values.  We are all used to traditional indexes where you have one value per index entry, a 1:1 ratio.  A MVI can have multiple records for each index record.  So you can have multiple postal codes, phone numbers, or other attributes from one JSON document indexed for quick access. See Multi-Valued Indexes for details.

For a very simple example, we will create a table. Note the casting of the $.nbr key/values as an unsigned array.

mysql> CREATE TABLE s (id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    -> name CHAR(20) NOT NULL,
    -> j JSON,
    -> INDEX nbrs( (CAST(j->'$.nbr' AS UNSIGNED ARRAY)))
    -> );
Query OK, 0 rows affected (0.11 sec)

Then add in some data. The goal is to have a set of multiple values available under the 'nbr' key where each number in the array represents some enumerated attribute.

mysql> SELECT * FROM s;
+----+-------+---------------------+
| id | name  | j                   |
+----+-------+---------------------+
|  1 | Moe   | {"nbr": [1, 7, 45]} |
|  2 | Larry | {"nbr": [2, 7, 55]} |
|  3 | Curly | {"nbr": [5, 8, 45]} |
|  4 | Shemp | {"nbr": [3, 6, 51]} |
+----+-------+---------------------+
4 rows in set (0.00 sec)

So we want to search on one of the values in the 'nbr' arrays.  Before 8.0.17, you could probably manage with a very elaborate JSON_CONTAINS() or JSON_EXTRACT() calls that have to handle multiple positions in that array.  But with MySQL 8.0.17 you can check to see if a desired value is a member of the array very easily, And there is another new function, MEMBER OF() that can take advantage of MVIs.

mysql>  SELECT * FROM s WHERE 7 MEMBER OF (j->"$.nbr");
+----+-------+---------------------+
| id | name  | j                   |
+----+-------+---------------------+
|  1 | Moe   | {"nbr": [1, 7, 45]} |
|  2 | Larry | {"nbr": [2, 7, 55]} |
+----+-------+---------------------+
2 rows in set (0.00 sec)

So we had two records with the number 7 in the array.  Think abut how many times you have multiple uses of things like postcodes, phone numbers, credit cards , or email addresses tied to a master record. Now you can keep all that within one JSON document and not have to make multiple dives into the data to retrieve that information. Imagine you have a 'build sheet' of a complex product, say a car, and you wanted to be able to quickly find the ones with certain attributes (GPS, tinted windows, and red leather seats).  A MVI give you a way to quickly and efficiently search for these attributes.

And for those curious about the query plan:

mysql> EXPLAIN SELECT * FROM s WHERE 7 MEMBER OF (j->"$.nbr")\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: s
   partitions: NULL
         type: ref
possible_keys: nbrs
          key: nbrs
      key_len: 9
          ref: const
         rows: 1
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

And yes the optimizer handles the new indexes easily. There are some implementation notes below that you will want to familiarize yourself with to make sure you know all the fine points of using MVIs at the end of this blog entry.

A Bigger Example

Lets create a table with one million rows with randomly created data inside a JSON array. Let us use a very simple table with a primary key and a JSON column that will supply the JSON array for the secondary index.

mysql>desc a1;
+-------+------------------+------+-----+---------+----------------+
| Field | Type             | Null | Key | Default | Extra          |
+-------+------------------+------+-----+---------+----------------+
| id    | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| data  | json             | YES  |     | NULL    |                |
+-------+------------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)

I wrote a quick PHP script to generate data on STDOUT to a temporary file. And that temporary file was fed in using the MySQL source command.  It is my personal preference to load data this way and probably a bit of a personality quirk but it does allow me to truncate or drop table definitions and re-use the same data.

<?php

for ($x=1; $x < 1000000; $x++) {

$i = rand(1,10000000);
$j = rand(1,10000000);
$k = rand(1,10000000);
echo "INSERT into a1 (id,data) VALUES (NULL,'{\"nbr\":[$i,$j,$k]}');\n";
}
?>

An example line from the file looks like this:


INSERT into a1 (id,data) VALUES (NULL,'{"nbr":[8526189,5951170,68]}');

The  entries in the array should have a pretty large cardinality with ranges between 1 and 10,000,000, especially considering there are only 1,000,000 rows.

Array subscripts in JSON start with a 0 (zero). And remember that the way to get to the third item in the array would be SELECT data->>"$.nbr[2]" for future reference. And is we wanted to check $.nbr[0] to $.nbr[N] we would have to explicitly check each one. Not pretty and expensive to perform.

My benchmark system is an older laptop with an i5 processor with 8k of ram filled with Ubuntu goodness.  So hopefully this would be a worst case scenario for hardware as nobody would run such old & slow gear in production, right (nobody runs gear slow than me, wink-wink nudge-nudge)?  The reason for such antiquated system usage is that comparisons would (or should) so similar gains on a percentage basis.

So lets us start by looking for a $.nbr[0] = 99999.  I added one record with all three elements in the array as five nines to make for a simple example.


*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: a1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9718585
     filtered: 100
        Extra: Using where
1 row in set, 1 warning (0.0004 sec)
Note (code 1003): /* select#1 */ select `test`.`a1`.`id` AS `id`,json_unquote(json_extract(`test`.`a1`.`data`,'$.nbr')) AS `data->>"$.nbr"` from `test`.`a1` where (json_unquote(json_extract(`test`.`a1`.`data`,'$.nbr[0]')) = 99999)

And there are no indexes available to be used and it is a full table scan, as indicated in the type: ALL above.  The query runs in about 0.61 seconds.

In the previous example we created the index with the table but this time it is created after the able. And I could have used ALTER TABLE too.

CREATE INDEX data__nbr_idx ON a1( (CAST(data->'$.nbr' AS UNSIGNED ARRAY)) );

So first trial query:

SELECT id, data->>"$.nbr" 
FROM a 
WHERE data->>"$.nbr[2]" = 99999

We have to pick a specific entry in the array as we can not search each item of the array (at least until we can use MVIs). The query runs in about 0.62 seconds, or a fraction slower but close enough for me to say they are the same time.  And EXPLAIN shows this is a full table scan and it does not take advantage of that index just created. So how do we access this new index and take advantage of the MVIs?


New Functions To The Rescue

There are new functions that can take advantage of MVIs when used to the right of the WHERE clause in a query with InnoDB tables. One of those functions is MEMBER OF().

SELECT _id, data->>"$.nbr" 
FROM a1
WHERE 99999 MEMBER OF (data->"$.nbr");

This query runs in 0.001 seconds which is much faster than the previous time of 0.61!  And we are searching all the data in the array not just one slot in the array. So if we do not know if the data we want is in $.nbr[0] or $.nbr[N], we can search all of the array entries easily.  So we are actually looking at more data and at a much faster rate. 

We can also use JSON_CONTAINS() and JSON_OVERLAPS() see Three New JSON Functions in MySQL 8.0.17 fro details.  These three functions are designed to take full advantage of Multi-Value indexes.

SELECT id, data->>"$.nbr" 
FROM a1 
WHERE JSON_CONTAINS(data->'$.nbr',cast('[99999,99999]' as JSON) );

+---------+-----------------------+
| id      | data->>"$.nbr"        |
+---------+-----------------------+
| 1000000 | [99999, 99999, 99999] |
+---------+-----------------------+
1 row in set (0.0013 sec)


SELECT id, data->>"$.nbr"  FROM a1  WHERE JSON_OVERLAPS(data->'$.nbr',cast('[99999,99999]' as JSON) );
+---------+-----------------------+
| id      | data->>"$.nbr"        |
+---------+-----------------------+
| 1000000 | [99999, 99999, 99999] |
+---------+-----------------------+
1 row in set (0.0012 sec)


Fine Points

You can create MVIs with CREATE TABLE, ALTER TABLE, or CREATE INDEX statements, just like any other index.The values are cast as a same-type scalar in a SQL array, A virtual column is transparently generated with all the values of the array and then a functional index is created on the virtual column.

Only one MVI can be used in a composite index. 

You can use MEMBER OF(), JSON_CONTAINS(), or JSON_OVERLAPS() in the WHERE clause to take advantage of MVIs. But once again you can you those three functions on non MVI JSON Data too.

DML for MVIs work like other DMLs for Indexes but you may have more than one insert/updates for a single clustered index record.

Empty arrays are not added to the index so do not try to search for empty values via the index.

MVIs do not support ordering of values so do not use them for primary keys! And no ASC or DSC either!!

And you are limited to 644,335 keys and 10,000 bytes by InnoDB for a single record.  The limit is a single InnoDB undo log page size so you should get up to 1250 integer values.

MVIs can not be used in a foreign key specification.

And check the cardinality of you data.  Having a very narrow range of numbers indexed will not really gain extra performance.