Friday, August 23, 2019

Check Constraints Issues

Earlier I wrote about check constraints when MySQL 8.0.16 was released. But this week I noticed two different folks having similar problems with them. And sadly it is 'pilot error'.

The first was labeled  MYSQL 8.0.17 CHECK not working even though it has been implemented and a cursory glance may make one wonder what is going on with the database.

The table is set up with two constraints. And old timers will probably mutter something under their breath about using ENUMs but here they are:

 JOB_TITLE varchar(20) CHECK(JOB_TITLE IN ('Lecturer', 'Professor', 'Asst. Professor', 'Sr. Lecturer')),  

DEPARTMENT varchar(20) CHECK(DEPARTMENT IN ('Biotechnology', 'Computer Science', 'Nano Technology', 'Information Technology')), 

And if you cut-n-paste the table definition into MySQL Workbench or MySQL Shell, it is perfectly valid DDL. 

So the table is good. 

What about the query?

 ('Sangeet', 'R', 'Sharma', date '1965-11-08', 'M', '11MH456633', 'Prof', 1200900, date '1990-12-16', 120090, 'Computer');

At first glance the query looks good.  But notice the use of 'Prof' instead of 'Professor' and 'Computer' instead of 'Computer Science'.  The two respective constraints are are working as they are supposed to. That is why you see the error message ERROR: 3819: Check constraint 'employee_chk_2' is violated.

So how to fix?  Well you can re-write the DDL so that 'Prof' and 'Computer'.  Or you can make the data match the specifications. If you are going to the trouble to add a constraint you are sabotaging your own work by doing things like this.

The Second Issue

In another Stackoverflow post someone with this table CREATE TABLE Test( ID CHAR(4),     CHECK (CHAR_LENGTH(ID) = 4) );  was wondering why constraint checks were be a problem with insert into Test(ID) VALUES ('12345');   

And the error you get if you try the above? ERROR: 1406: Data too long for column 'ID' at row 1!

Well, this is not a case where a constraint check is behaving badly.  Look at the table definition.  ID is a four (4) CHAR column.  And the length of '12345' is not four! 

Now in the past MySQL was lax and would truncate that extra character and provide a warning.  And those warnings were often ignored.  MySQL had a bad reputation for doing that truncation and the SQL mode of the server was changed to a default setting that does not allow that truncation. Now the server tells you the data is too long for that column.  The constraint checks have not come into play at that stage as the server sees you trying to shove five characters into a four character space. 

So how to fix? 1) make ID a CHAR(5) and rewrite the constraint, 2) change the SQL mode to allow the server to truncate data, or 3) do not try to put five characters into a space you designed for four.

My Gripe

It is frustrating to see something like constraint checks that are a really useful tool being abused.  And it is frustrating as so many people search the web for answers with keywords and will not look at the manual.  In both of the examples above five minutes with the manual pages would have save a lot of effort.

Wednesday, August 14, 2019

Improved MySQL Query Performance With InnoDB Mutli Value Indexes

Multi-Valued Indexes are going to change the way you think about using JSON data and the way you architect your data. Before MySQL 8.0.17 you could store data in JSON arrays but trying to search on that data in those embedded arrays was tricky and usually required a full table scan.  But now it is easy and very quick to search and to access the data in JSON arrays.

Multi-Valued Indexes

A Multi-Valued Index (MVI) is a secondary index defined on a column made up of an array of values.  We are all used to traditional indexes where you have one value per index entry, a 1:1 ratio.  A MVI can have multiple records for each index record.  So you can have multiple postal codes, phone numbers, or other attributes from one JSON document indexed for quick access. See Multi-Valued Indexes for details.

For a very simple example, we will create a table. Note the casting of the $.nbr key/values as an unsigned array.

    -> name CHAR(20) NOT NULL,
    -> j JSON,
    -> INDEX nbrs( (CAST(j->'$.nbr' AS UNSIGNED ARRAY)))
    -> );
Query OK, 0 rows affected (0.11 sec)

Then add in some data. The goal is to have a set of multiple values available under the 'nbr' key where each number in the array represents some enumerated attribute.

mysql> SELECT * FROM s;
| id | name  | j                   |
|  1 | Moe   | {"nbr": [1, 7, 45]} |
|  2 | Larry | {"nbr": [2, 7, 55]} |
|  3 | Curly | {"nbr": [5, 8, 45]} |
|  4 | Shemp | {"nbr": [3, 6, 51]} |
4 rows in set (0.00 sec)

So we want to search on one of the values in the 'nbr' arrays.  Before 8.0.17, you could probably manage with a very elaborate JSON_CONTAINS() or JSON_EXTRACT() calls that have to handle multiple positions in that array.  But with MySQL 8.0.17 you can check to see if a desired value is a member of the array very easily, And there is another new function, MEMBER OF() that can take advantage of MVIs.

mysql>  SELECT * FROM s WHERE 7 MEMBER OF (j->"$.nbr");
| id | name  | j                   |
|  1 | Moe   | {"nbr": [1, 7, 45]} |
|  2 | Larry | {"nbr": [2, 7, 55]} |
2 rows in set (0.00 sec)

So we had two records with the number 7 in the array.  Think abut how many times you have multiple uses of things like postcodes, phone numbers, credit cards , or email addresses tied to a master record. Now you can keep all that within one JSON document and not have to make multiple dives into the data to retrieve that information. Imagine you have a 'build sheet' of a complex product, say a car, and you wanted to be able to quickly find the ones with certain attributes (GPS, tinted windows, and red leather seats).  A MVI give you a way to quickly and efficiently search for these attributes.

And for those curious about the query plan:

mysql> EXPLAIN SELECT * FROM s WHERE 7 MEMBER OF (j->"$.nbr")\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: s
   partitions: NULL
         type: ref
possible_keys: nbrs
          key: nbrs
      key_len: 9
          ref: const
         rows: 1
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

And yes the optimizer handles the new indexes easily. There are some implementation notes below that you will want to familiarize yourself with to make sure you know all the fine points of using MVIs at the end of this blog entry.

A Bigger Example

Lets create a table with one million rows with randomly created data inside a JSON array. Let us use a very simple table with a primary key and a JSON column that will supply the JSON array for the secondary index.

mysql>desc a1;
| Field | Type             | Null | Key | Default | Extra          |
| id    | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| data  | json             | YES  |     | NULL    |                |
2 rows in set (0.00 sec)

I wrote a quick PHP script to generate data on STDOUT to a temporary file. And that temporary file was fed in using the MySQL source command.  It is my personal preference to load data this way and probably a bit of a personality quirk but it does allow me to truncate or drop table definitions and re-use the same data.


for ($x=1; $x < 1000000; $x++) {

$i = rand(1,10000000);
$j = rand(1,10000000);
$k = rand(1,10000000);
echo "INSERT into a1 (id,data) VALUES (NULL,'{\"nbr\":[$i,$j,$k]}');\n";

An example line from the file looks like this:

INSERT into a1 (id,data) VALUES (NULL,'{"nbr":[8526189,5951170,68]}');

The  entries in the array should have a pretty large cardinality with ranges between 1 and 10,000,000, especially considering there are only 1,000,000 rows.

Array subscripts in JSON start with a 0 (zero). And remember that the way to get to the third item in the array would be SELECT data->>"$.nbr[2]" for future reference. And is we wanted to check $.nbr[0] to $.nbr[N] we would have to explicitly check each one. Not pretty and expensive to perform.

My benchmark system is an older laptop with an i5 processor with 8k of ram filled with Ubuntu goodness.  So hopefully this would be a worst case scenario for hardware as nobody would run such old & slow gear in production, right (nobody runs gear slow than me, wink-wink nudge-nudge)?  The reason for such antiquated system usage is that comparisons would (or should) so similar gains on a percentage basis.

So lets us start by looking for a $.nbr[0] = 99999.  I added one record with all three elements in the array as five nines to make for a simple example.

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: a1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9718585
     filtered: 100
        Extra: Using where
1 row in set, 1 warning (0.0004 sec)
Note (code 1003): /* select#1 */ select `test`.`a1`.`id` AS `id`,json_unquote(json_extract(`test`.`a1`.`data`,'$.nbr')) AS `data->>"$.nbr"` from `test`.`a1` where (json_unquote(json_extract(`test`.`a1`.`data`,'$.nbr[0]')) = 99999)

And there are no indexes available to be used and it is a full table scan, as indicated in the type: ALL above.  The query runs in about 0.61 seconds.

In the previous example we created the index with the table but this time it is created after the able. And I could have used ALTER TABLE too.

CREATE INDEX data__nbr_idx ON a1( (CAST(data->'$.nbr' AS UNSIGNED ARRAY)) );

So first trial query:

SELECT id, data->>"$.nbr" 
WHERE data->>"$.nbr[2]" = 99999

We have to pick a specific entry in the array as we can not search each item of the array (at least until we can use MVIs). The query runs in about 0.62 seconds, or a fraction slower but close enough for me to say they are the same time.  And EXPLAIN shows this is a full table scan and it does not take advantage of that index just created. So how do we access this new index and take advantage of the MVIs?

New Functions To The Rescue

There are new functions that can take advantage of MVIs when used to the right of the WHERE clause in a query with InnoDB tables. One of those functions is MEMBER OF().

SELECT _id, data->>"$.nbr" 
WHERE 99999 MEMBER OF (data->"$.nbr");

This query runs in 0.001 seconds which is much faster than the previous time of 0.61!  And we are searching all the data in the array not just one slot in the array. So if we do not know if the data we want is in $.nbr[0] or $.nbr[N], we can search all of the array entries easily.  So we are actually looking at more data and at a much faster rate. 

We can also use JSON_CONTAINS() and JSON_OVERLAPS() see Three New JSON Functions in MySQL 8.0.17 fro details.  These three functions are designed to take full advantage of Multi-Value indexes.

SELECT id, data->>"$.nbr" 
FROM a1 
WHERE JSON_CONTAINS(data->'$.nbr',cast('[99999,99999]' as JSON) );

| id      | data->>"$.nbr"        |
| 1000000 | [99999, 99999, 99999] |
1 row in set (0.0013 sec)

SELECT id, data->>"$.nbr"  FROM a1  WHERE JSON_OVERLAPS(data->'$.nbr',cast('[99999,99999]' as JSON) );
| id      | data->>"$.nbr"        |
| 1000000 | [99999, 99999, 99999] |
1 row in set (0.0012 sec)

Fine Points

You can create MVIs with CREATE TABLE, ALTER TABLE, or CREATE INDEX statements, just like any other index.The values are cast as a same-type scalar in a SQL array, A virtual column is transparently generated with all the values of the array and then a functional index is created on the virtual column.

Only one MVI can be used in a composite index. 

You can use MEMBER OF(), JSON_CONTAINS(), or JSON_OVERLAPS() in the WHERE clause to take advantage of MVIs. But once again you can you those three functions on non MVI JSON Data too.

DML for MVIs work like other DMLs for Indexes but you may have more than one insert/updates for a single clustered index record.

Empty arrays are not added to the index so do not try to search for empty values via the index.

MVIs do not support ordering of values so do not use them for primary keys! And no ASC or DSC either!!

And you are limited to 644,335 keys and 10,000 bytes by InnoDB for a single record.  The limit is a single InnoDB undo log page size so you should get up to 1250 integer values.

MVIs can not be used in a foreign key specification.

And check the cardinality of you data.  Having a very narrow range of numbers indexed will not really gain extra performance.

Friday, August 2, 2019

Parallel Table Importer in MySQL Shell

The ability to import data in parallel is now a feature of the new MySQL Shell.  And you can run it from a command line instead of from within an interactive shell.

How Does It Work?

The importTable utility, new on MySQL Shell 8.0.17,  analyzes the input file, divides that file into chunks, and then uploads that data using parallel connections.  This is much faster than the LOAD DATA statement. You get the ability to define how the data is delimited as far a field and lines are defined.  And it works on DOS CSVs, Unix CSVs, TSVs, and JSON if that JSON is in one document per line mode. You also can adjust the number of threads, number of bytes sent per each chunk, and the maximum rate of data transfer per thread so you can balance the load on the network and the speed of data transfer.

The parallel table import utility works only  on the MySQL Classic Protocol and not the newer X Protocol. The X Protocol connections do not support LOAD DATA statements. The parallel table import utility makes use of the  LOAD DATA LOCAL INFILE statements to upload data chunks from the input file. Make sure that the data file you want to import is     in a location that is accessible to the client host as a local disk. And  local_infile system variable must be set to ON on the target server.

What to specify

Besides the name of your data file you can specify the schema, specify the table name, an array of column names to map to the table in cases where you data file does not have all the columns as the table, designate if you have unique keys (or if you desire duplicates), the termination of individual lines, the terminations of columns, what fields are enclosed by, the number of threads for uploading, the bytes per chunk, the maximum I/O rate, and the dialect of your data file (CSV, TSV, JSON). And you can get a status report on the progress.

From the Command Line Too

You can use util.inputTable from within an interactive shell or from a command line.  The following is the import of one million rows from a command line.

$ mysqlsh mysql://root@localhost --ssl-mode=DISABLED -- util import-table foo.csv --schema=test --table=foo
Importing from file 'foo.csv' to table `test`.`foo` in MySQL Server at /var/lib/mysql/mysql.sock using 3 threads
[Worker000] Records: 1496823  Deleted: 0  Skipped: 0  Warnings: 0
[Worker001] Records: 4204841  Deleted: 0  Skipped: 0  Warnings: 0
[Worker002] Records: 4298336  Deleted: 0  Skipped: 0  Warnings: 0
100% (117.80 MB / 117.80 MB), 490.81 KB/s
File 'foo.csv' (117.80 MB) was imported in 2 min 25.9817 sec at 806.94 KB/s
Total rows affected in Records: 10000000  Deleted: 0  Skipped: 0  Warnings: 0

For details and more examples please see the MySQL Shell Manual.