Friday, February 26, 2021

Searching JSON Data

 A post on the MySQL forums caught my eye today. The distilled question was why was an index not being used to search JSON data.  The poster (in blue) had done most of the steps needed almost correctly but just missed the crucial steps. 

Suppose we have this table with an index on a virtual generated column as JSON attribute index pattern

```
CREATE TABLE applications (
id BINARY(16) PRIMARY KEY,
data JSON NOT NULL,
) ENGINE=InnoDB;
CREATE INDEX idx ON applications ((CAST(data->>"$.name" AS CHAR(10))));

 

One of the frustrations for novices with databases is the combined richness of Structured Query Language or SQL and the many options available with MySQL which can cause problems.  In this case the index was created from a generated data but is it really a generated column?

For queries with `=` operator to compare the text field of data, the index is hit and works as expected

```

mysql> EXPLAIN SELECT * FROM applications FORCE INDEX (idx) WHERE CAST(data->>'$.name' AS CHAR(10)) = 'app-1';

+----+----+----+-------+------+---------------+------+---------+-------+------+----------+-------+

| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra 
+----+-----+-------+---+------+---------------+------+---------+-------+------+----------+-------+

| 1 | SIMPLE | applications | NULL | ref | idx | idx | 13 | const | 1 | 100.00 | NULL |

+----+----+----+-------+----
--+---------------+------+---------+-------+------+----------+-------+

1 row in set, 1 warning (0.00 sec)

So the index works as expected.  But there is trouble on the horizon.  And by the way, I do not like to use FORCE INDEX.

But for queries with `LIKE` operator, even in case search without wildcard, the indexes is not hit

```

mysql> EXPLAIN SELECT * FROM applications FORCE INDEX (idx) WHERE CAST(data->>'$.name' AS CHAR(10)) LIKE 'app-1';

+----+--+-----+---+------+---------------+------+---------+------+------+----------+-------------+

| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+----+----+----+-------+------+----------+------+---------+------+------+----------+-------------+

| 1 | SIMPLE | applications | NULL | ALL | NULL | NULL | NULL | NULL | 6 | 100.00 | Using where |

+----+---+------+-+------+---------------+------+---------+------+------+----------+-------------+

1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT * FROM applications FORCE INDEX (idx) WHERE CAST(data->>'$.name' AS CHAR(10)) LIKE 'app-%';

+----+---+------+--+------+----------+------+---------+------+------+----------+-------------+

| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+----+----+-----+---+------+-------------+------+---------+------+------+----------+-------------+

| 1 | SIMPLE | applications | NULL | ALL | NULL | NULL | NULL | NULL | 6 | 100.00 | Using where |

+----+------------+------+----+------+------+------+------+------+------+----------+-------------+

1 row in set, 1 warning (0.00 sec)


So why would LIKE cause a problem? And is there a work around?  Some of you may be wondering if this is a case for using Full Text Search.  

Digging Deeper

First I wanted to 'trust but verify' the examples.  I have learned the hard way that sometimes that the example code can be wrong or omit something critical. 

 create table ap1 (id serial, 
              data JSON not null);

 create index ap1_idx on ap1 ((cast(data->>"$.name" AS CHAR(10))));

I created some test data and ran the query wondering if I would get the same result. The equal operator worked fine but not the like with a wildcard.  

explain  select * from ap1 where cast(data->>'$.name' as char(10))='app-1'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ap1
   partitions: NULL
         type: ref
possible_keys: ap1_idx
          key: ap1_idx
      key_len: 43
          ref: const
         rows: 1
     filtered: 100
        Extra: NULL
 explain  select * from ap1 where cast(data->>'$.name' as char(10)) like 'app-%'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ap1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
     filtered: 100
        Extra: Using where
1 row in set, 1 warning (0.0010 sec)

A query with a an equal sign worked but not with the wildcard. Well dang! It does not work. Just as the original poster described.

The thing nagging at the back of my mind was the index being set up as CREATE INDEX idx ON applications cast(json_unquote(json_extract(`data`,_utf8mb4\'$.name\')) as char(10) charset utf8mb4). Sometimes you can get a 'code smell' where you can not point out exactly what is wrong but you know somethings is not right.

But is there a better way?

The index as originally presented bothered me.  So I decided to use a generated column and then index that column.  Logically this is two steps and not one.  But that original code for the index bothered me enough to make those two steps.
 
SQL>create table app (id serial, 
         doc JSON, 
         extracted_name char(20) 
              generated always as (json_unquote(doc->"$.name")), 
         index i (extracted_name)
);

Query OK, 0 rows affected (0.0479 sec)

SQL > explain select * from app where extracted_name like 'app-%' \G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: app
   partitions: NULL
         type: range
possible_keys: i
          key: i
      key_len: 81
          ref: NULL
         rows: 3
     filtered: 100
        Extra: Using where
1 row in set, 1 warning (0.0010 sec)
SQL > select * from app where extracted_name like 'app-%' \G
*************************** 1. row ***************************
            id: 2
           doc: {"name": "app-1"}
extracted_name: app-1
*************************** 2. row ***************************
            id: 1
           doc: {"name": "app-2"}
extracted_name: app-2
*************************** 3. row ***************************
            id: 3
           doc: {"name": "app-3"}
extracted_name: app-3
3 rows in set (0.0007 sec)

  

This provides the ability to search with wildcards. I will dig into the troublesome index but wanted to share this work around.