Open Source Databases: Histograms and Faster MySQL Queries

Histograms were introduced with MySQL 8.0 and are a valuable way of speeding up queries. The MySQL optimizer assumes that data in a column has evenly distributed values. Even distribution of data probably does not reflect much of the data sitting right now in your database.

The optimizer wants to find the most efficient way to return the data requested in a query. If it has poor information on that data, then the optimizer will make a 'guesstimate' that will will result in a query plan that will not perform well. But if the optimizer has good information, in this case provided by a histogram, then it can produce a better query plan.

In the following example a able is filled with data that is not evenly distributed. In the histogram image following, the data is represented in what looks like a rollercoaster side view.

create table dist (id serial not null primary key,
x int unsigned not null);

insert into dist (x) value (1),(1),(1),(1),(1),
                                       (2),(3),(3),(3),(3),(3),(3),
                                       (4),(4),(5),(6),(6),(6),(6),
                                       (6),(6),(6),(8),(9),(9),(9),(9);

select x, count(x) from dist group by x;
+---+----------+
| x | count(x) |
+---+----------+
| 1 | 5 |
| 2 | 1 |
| 3 | 6 |
| 4 | 2 |
| 5 | 1 |
| 6 | 7 |
| 8 | 1 |
| 9 | 4 |
+---+----------+

There are 22 values of x that have a value less than seven. If we examine output of a query where we are looking for the those values, the optimizer estimates, as seen in the EXLAIN output below, it will need to roughly a third of the 27or 9 rows in the table. Here the optimizer has made a guess from assuming an even distribution, a third of 27 is 9. It is easy to see that 9 is no where close to 22.

Imagine a contractor estimates that it will take $9 to make you a widget but the final bill is $22. Or your GPS application in your phone informs you that you are nine blocks from your destination but in reality is a much longer 22 blocks away. In these two cases there may be valid reasons for the cost and distance 'overruns' but they are still frustrating to have to come up with the extra money of walk the extra distance. Likewise this query generates a poorly performing query plan.

EXPLAIN select x, count(x) from dist where x < 7\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: dist
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 27
filtered: 33.32999801635742
Extra: Using where
1 row in set, 1 warning (0.0007 sec)
Note (code 1003): /* select#1 */ select `fk`.`dist`.`x` AS `x`,count(`fk`.`dist`.`x`) AS `count(x)` from `fk`.`dist` where (`fk`.`dist`.`x` < 7)

In this case a histogram provides a a better query plan. Creating a histogram is easy and in this case ten buckets will be used to store the values.

ANALYZE TABLE dist UPDATE HISTOGRAM ON x WITH 10 BUCKETS;

+---------+-----------+----------+----------------------------------------------+

+---------+-----------+----------+----------------------------------------------+

+---------+-----------+----------+----------------------------------------------+

And rerun EXPLAIN.

EXPLAIN select x, count(x) from dist where x < 7\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: dist

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 27

filtered: 81.48148345947266

Extra: Using where

1 row in set, 1 warning (0.0046 sec)

Note (code 1003): /* select#1 */ select `fk`.`dist`.`x` AS `x`,count(`fk`.`dist`.`x`) AS `count(x)` from `fk`.`dist` where (`fk`.`dist`.`x` < 7)

81% of 27 is 22 which is the value of the number of rows where x is less than 7. If the cumulative frequency of the bucket values is examined it is easy to see that the values less than 7 is indeed 81%.

SELECT (SUBSTRING_INDEX(v, ':', -1)) value,

concat(round(c*100,1),'%') cumulfreq,

CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq

FROM information_schema.column_statistics,

JSON_TABLE(histogram->'$.buckets','$[*]'

COLUMNS(v VARCHAR(60) PATH '$[0]',

c double PATH '$[1]')) hist

WHERE table_name = 'dist' and column_name = 'x';

+-------+-----------+-------+

| value | cumulfreq | freq |

+-------+-----------+-------+

| 1 | 18.5% | 18.5% |

| 2 | 22.2% | 3.7% |

| 3 | 44.4% | 22.2% |

| 4 | 51.9% | 7.4% |

| 5 | 55.6% | 3.7% |

| 6 | 81.5% | 25.9% |

| 8 | 85.2% | 3.7% |

| 9 | 100% | 14.8% |

+-------+-----------+-------+

Histograms are great for data that does not change frequently and unlike an index there is no ongoing maintenance overhead to impact performance. Of course as the data changes, the value of the histogram degrades but they are easily updated.

Wednesday, October 27, 2021

Histograms and Faster MySQL Queries