Monday, August 24, 2020

MySQL Comparing INTs and CHARs

     How does MySQL compare INTs with CHARs? I was asked this very question over the weekend and wanted to share that information plus some of the changes in MySQL 8.0.21 in this area.  And yes those changes are pretty big.

Casting

    Computers are good at making comparisons of two values but only if everything else the same.  Comparing an integer with another integer is simple.  Same data with same data type comparisons are a good thing.  But what about when you need to compare a numeric 7 with a "7" where the number is in a string?  In such cases one or both numbers need to be changed into the same basic data type. Imagine your favorite Harry Potter character waving their magic wand and shouting 'accio data' to change two different magical pieces of data into one common data type.  No, Hogwarts was the the reason this conversion is called casting but this 'magic' needs to be made for a good comparison.

    If you read the Optimizer Notes section of the MySQL 8.0.21 Release Notes you will run into a notice that MySQL injects casts to avoid mismatches for numeric and temporal data with string data. The big trick was keeping backward compatibility with previous versions while matching the SQL standard.  Now when the optimizer is comparing numeric and temporal types and the expected data type does not match  it will now add casting operations in the item tree inside expressions and conditions. For instance if you are comparing a YEAR to string they will both be converted to a DOUBLE.

Example

    We have two tables and are comparing an INT to a CHAR.  If we run EXPLAIN ANALYZE of the query we get the details.

explain analyze select * 
from t1 
join t2 on t2.k = t1.k\G

*************************** 1. row ***************************

EXPLAIN: -> Inner hash join (cast(t2.k as double) = cast(t1.k as double))  (cost=4.75 rows=5) (actual time=1.759..1.771 rows=5 loops=1)

    -> Table scan on t1  (cost=0.22 rows=6) (actual time=1.670..1.677 rows=6 loops=1)

    -> Hash

        -> Table scan on t2  (cost=0.75 rows=5) (actual time=0.034..0.043 rows=5 loops=1)

    If we look at the original query we are trying to join two tables where the CHAR t2.k is equal to the INT t1.k.   The magenta highlighted text above shows where both the t2.k and the t1.k columns are cast as doubles.

    Running EXPLAIN without the ANALYZE we can see the query plan's version of the query that has been generated by the optimizer.

explain select * from t1 join t2 on t2.k = t1.k\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5
     filtered: 100
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 6
     filtered: 16.666667938232422
        Extra: Using where; Using join buffer (hash join)
2 rows in set, 1 warning (0.0010 sec)
Note (code 1003): /* select#1 */ select `demo`.`t1`.`id` AS `id`,`demo`.`t1`.`k` AS `k`,`demo`.`t2`.`id` AS `id`,`demo`.`t2`.`k` AS `k` from `demo`.`t1` join `demo`.`t2` where (cast(`demo`.`t2`.`k` as double) = cast(`demo`.`t1`.`k` as double))

    We can see that the original query of select * from t1 join t2 on t2.k = t1.k has been rewritten to select `demo`.`t1`.`id` AS `id`,`demo`.`t1`.`k` AS `k`,`demo`.`t2`.`id` AS `id`,`demo`.`t2`.`k` AS `k` from `demo`.`t1` join `demo`.`t2` where (cast(`demo`.`t2`.`k` as double) = cast(`demo`.`t1`.`k` as double)) by the optimizer.  

    I highly recommend looking at the query plan to help understand what the MySQL server needs to do to make your query work.