Thursday, November 11, 2021

Don't Forget to Check Your Accounts

    The start of fall in the northern hemisphere is a time for many of use to prepare for the approaching winter. Part of that is clearing away and storing summer things and pulling out of storage the winter things. And one of those things I suggest to you do is look at the accounts on your MySQL instances.

    It is very easy, especially with multiple folks creating and alter doing user administration, to have redundant accounts.  You may also have  unneeded accounts.  Or you can have accounts that expired and nobody asked to have them reset.  And you may have accounts that you are not sure who or what uses them.



Start with this simple query to get a first look at your user data.


use mysql;
select User, Host, password_expired, account_locked, User_attributes from user;  

    Examine which accounts are expired and locked.  Do you know why they are in this state? 

    Next look at the User and Host data sorted by User. 

select User, Host from user order by User, Host;
+------------------+-----------+
| User             | Host      |
+------------------+-----------+
| Foo              | %         |
| bar              | localhost |
| bill             | localhost |
| davetest         | localhost |
| demo             | localhost |
| demo2            | localhost |
| dstokes          | %         |
| dstokes          | localhost |
| foo              | localhost |
| foobar           | localhost |
| jack             | localhost |
| mary             | localhost |
| mike             | %         |
| mysql.infoschema | localhost |
| mysql.session    | localhost |
| mysql.sys        | localhost |
| myuser           | %         |
| root             | localhost |
+------------------+-----------+

    In this case notice that there are two separate and distinct accounts where the User column is 'dstokes'.  Someone might have wanted one account or maybe there are two different users to justify the account.  Always look at the account as the pair of the User and Host columns.  If this was supposed to be one account, save yourself some future headaches and merge the accounts.  Remember the MySQL authentication system is a wee bit promisors and lets the first matching account into the server.  This is why dstokes@localhost can not read the data that is granted to dstokes@%! 

    And when you discover that 'dstokes'@'%' is Dave Stokes and dstokes@'localhost is Dilbert Stokes please annotate this information.

ALTER USER 'dstokes'@'%' COMMENT 'This is remote account for Dave Stokes x1234';
ALTER USER 'dstokes'@'localhost' COMMENT 'Dilbert Stokes debug account';

Wednesday, October 27, 2021

Histograms and Faster MySQL Queries

     Histograms were introduced with MySQL 8.0 and are a valuable way of speeding up queries.  The MySQL optimizer assumes that data in a column has evenly distributed values. Even distribution of data probably does not reflect much of the data sitting right now in your database.  

    The optimizer wants to find the most efficient way to return the data requested in a query.  If it has poor information on that data, then the optimizer will make a 'guesstimate' that will will result in a query plan that will not perform well.  But if the optimizer has good information, in this case provided by a histogram, then it can produce a better query plan.

    In the following example a able is filled with data that is not evenly distributed.  In the histogram image following, the data is represented in what looks like a rollercoaster side view. 

create table dist (id serial not null primary key, 
                            x int unsigned not null);

insert into dist (x) value (1),(1),(1),(1),(1),
                                        (2),(3),(3),(3),(3),(3),(3),
                                        (4),(4),(5),(6),(6),(6),(6),
                                        (6),(6),(6),(8),(9),(9),(9),(9);

select x, count(x) from dist group by x;
+---+----------+
| x | count(x) |
+---+----------+
| 1 |        5 |
| 2 |        1 |
| 3 |        6 |
| 4 |        2 |
| 5 |        1 |
| 6 |        7 |
| 8 |        1 |
| 9 |        4 |
+---+----------+


Histogram
    There are 22 values of x that have a value less than seven.  If we examine output of a query where we are looking for the those values, the optimizer estimates, as seen in the EXLAIN output below,  it will need to roughly a third of the 27or 9 rows in the table. Here the optimizer has made a guess from assuming an even distribution, a third of 27 is 9.  It is easy to see that 9 is no where close to 22.


    Imagine a contractor estimates that it will take $9 to make you a widget but the final bill is $22.  Or your GPS application in your phone informs you that you are nine blocks from your destination but in reality is a much longer 22 blocks away.  In these two cases there may be valid reasons for the cost and distance 'overruns' but they are still frustrating to have to come up with the extra money of walk the extra distance.  Likewise this query generates a poorly performing  query plan.

 EXPLAIN select x, count(x) from dist where x < 7\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dist
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 27
     filtered: 33.32999801635742

        Extra: Using where
1 row in set, 1 warning (0.0007 sec)
Note (code 1003): /* select#1 */ select `fk`.`dist`.`x` AS `x`,count(`fk`.`dist`.`x`) AS `count(x)` from `fk`.`dist` where (`fk`.`dist`.`x` < 7)

    In this case a histogram provides a a better query plan. Creating a histogram is easy and in this case ten buckets will be used to store the values.

ANALYZE TABLE dist UPDATE HISTOGRAM ON x WITH 10 BUCKETS;
+---------+-----------+----------+----------------------------------------------+
| Table   | Op        | Msg_type | Msg_text                                     |
+---------+-----------+----------+----------------------------------------------+
| fk.dist | histogram | status   | Histogram statistics created for column 'x'. |
+---------+-----------+----------+----------------------------------------------+

    And rerun EXPLAIN.

EXPLAIN select x, count(x) from dist where x < 7\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dist
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 27
     filtered: 81.48148345947266
        Extra: Using where
1 row in set, 1 warning (0.0046 sec)
Note (code 1003): /* select#1 */ select `fk`.`dist`.`x` AS `x`,count(`fk`.`dist`.`x`) AS `count(x)` from `fk`.`dist` where (`fk`.`dist`.`x` < 7)

    81% of 27 is 22 which is the value of the number of rows where x is less than 7.  If the cumulative frequency of the bucket values is examined it is easy to see that the values less than 7 is indeed 81%.

SELECT (SUBSTRING_INDEX(v, ':', -1)) value,        
                concat(round(c*100,1),'%') cumulfreq,             
                CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq     
FROM information_schema.column_statistics,         
            JSON_TABLE(histogram->'$.buckets','$[*]'                 
                COLUMNS(v VARCHAR(60) PATH '$[0]',                    
            c double PATH '$[1]')) hist            
WHERE  table_name = 'dist'  and column_name = 'x';
+-------+-----------+-------+
| value | cumulfreq | freq  |
+-------+-----------+-------+
| 1     | 18.5%     | 18.5% |
| 2     | 22.2%     | 3.7%  |
| 3     | 44.4%     | 22.2% |
| 4     | 51.9%     | 7.4%  |
| 5     | 55.6%     | 3.7%  |
| 6     | 81.5%     | 25.9% |
| 8     | 85.2%     | 3.7%  |
| 9     | 100%      | 14.8% |
+-------+-----------+-------+



    Histograms are great for data that does not change frequently and unlike an index there is no ongoing maintenance overhead to impact performance.  Of course as the data changes, the value of the histogram degrades but they are easily updated.  
    

Friday, October 22, 2021

Does Column Order Matter in MySQL Multi Column Indexes

    Multi column indexes are a powerful way to speed up queries but they are often misunderstood.  In most other databases an index on columns a, b, and c can only be used when searching on columns (a,b,& c), (a & b), and (a)  -- according to the manual. Also that index supposedly can not be used to search for (b & c) or just (c).  Well, that is the way I learned it and the way I have been teaching it. But I was wrong!  Now would be a good time to read the MySQL manual on Multiple-Column Indexes as it does not work as noted (or see the excerpt below) and I assumed MySQL worked the same way as the other databases. Well, it doesn't!

Doubt me?  Well, lets create table and add in some data. 

Table and Data

SQL > create table abcd (a serial auto_increment primary key, b int, c int, d int);

Query OK, 0 rows affected (0.0404 sec)

 SQL > insert into abcd values (null,1,2,3),(null,4,5,6),(null,7,8,9);

Query OK, 3 rows affected (0.0081 sec)

Records: 3  Duplicates: 0  Warnings: 0

SQL > select * from abcd;

+---+---+---+---+
| a | b | c | d |
+---+---+---+---+
| 1 | 1 | 2 | 3 |
| 2 | 4 | 5 | 6 |
| 3 | 7 | 8 | 9 |
+---+---+---+---+

3 rows in set (0.0006 sec)

And then we need the index

SQL > create index bcd_index on abcd(b,c,d);


Testing 

    The first test we use the data from the first row where we look for the three columns (b,c,d) in the order specified in the creation of the index.  And guess what? It works as expected and uses the bcd_index.

SQL > explain format=tree select * from abcd where b=1 and c=2 and d=3\G
*************************** 1. row ***************************
EXPLAIN: -> Covering index lookup on abcd using bcd_index (b=1, c=2, d=3)  (cost=0.35 rows=1)

1 row in set (0.0006 sec)


    Leaving on the last column, searching on (b,c) also works as expected.

SQL > explain format=tree select * from abcd where b=1 and c=2\G
*************************** 1. row ***************************
EXPLAIN: -> Covering index lookup on abcd using bcd_index (b=1, c=2)  (cost=0.35 rows=1)

1 row in set (0.0008 sec)

    As does searching on just the first column (b)

SQL > explain format=tree select * from abcd where b=1\G
*************************** 1. row ***************************
EXPLAIN: -> Covering index lookup on abcd using bcd_index (b=1)  (cost=0.35 rows=1)

1 row in set (0.0006 sec)


    But what if we skip the (c) column, the one in the middle?  Well, I had thought that since (b,c) was not part of (b,c,d) or (b,c) or (b) as defined in the index then it could not use the index.

SQL > explain format=tree select * from abcd where b=1 and d=3\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (abcd.d = 3)  (cost=0.28 rows=0)
    -> Covering index lookup on abcd using bcd_index (b=1)  (cost=0.28 rows=1)

1 row in set (0.0007 sec)

    Well, I thought, maybe TREE format from EXPLAIN was not giving me enough data. So rerun EXPLAIN without TREE.

SQL > explain  select * from abcd where b=1 and d=3\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: abcd
   partitions: NULL
         type: ref
possible_keys: bcd_index
          key: bcd_index
      key_len: 5
          ref: const
         rows: 1
     filtered: 33.333335876464844
        Extra: Using where; Using index

1 row in set, 1 warning (0.0008 sec)
Note (code 1003): /* select#1 */ select `fk`.`abcd`.`a` AS `a`,`fk`.`abcd`.`b` AS `b`,`fk`.`abcd`.`c` AS `c`,`fk`.`abcd`.`d` AS `d` from `fk`.`abcd` where ((`fk`.`abcd`.`d` = 3) and (`fk`.`abcd`.`b` = 1))


    Okay, I know that on a (b,c,d) index that it is not supposed to work with a (d) search. Boy, was I wrong.

SQL > explain format=tree select * from abcd where d=3\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (abcd.d = 3)  (cost=1.30 rows=1)
    -> Index scan on abcd using bcd_index  (cost=1.30 rows=3)

1 row in set (0.0012 sec)

But the Manual!?!

The manual states:
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows. For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).

MySQL cannot use the index to perform lookups if the columns do not form a leftmost prefix of the index.

Well shoot!  This is a case when there is a difference between what the manual says and what the results are telling me.  Frankly, I find this pretty cool even if it makes me rethink the way I create indexes.   MySQL  is more flexible which gives you better performance

Friday, October 1, 2021

Have an Hour or So And Want To Learn MySQL?

 Want to learn MySQL? Have an hour or so?  Go to Oracle's Learning Explorer and sign up for the FREE MySQL Explorer learning path.  It begins with an overview of MySQL, discusses the client/server model, and then leads you through the use of MySQL.  You do not need previous MySQL or database experience.

MySQL Explorer

MySQL is the most-used opensource database and over the years I have has requests for introductory materials. Well, now I can send them to the MySQL Explorer for some web based training. And you can earn the explorer badge when you pass the end of class quiz.



Monday, September 27, 2021

Dave's Quiz 3 -- Change schemas

     This weeks quiz is a little different.  Many of you grew up with MySQL's sample world database but have not used the world_x database (see https://elephantdolphin.blogspot.com/2021/09/sample-databases-for-learning-mysql-and.html for details) and may be shocked to find that there is now a forth table to the database and some of the information is now stored in a JSON data type column. 

    The quiz is to take the following code that works with the world database and convert it to work with the world_x database.

SELECT city.Name as 'City'
    , city.Population as 'city pop'
    , country.name as 'country'
    , country.Population as 'country pop'
    , round(city.Population / country.Population * 100,2) as 'Percent'  
FROM city 
JOIN country on (city.CountryCode = country.Code) 
ORDER BY Country.Name,  city.name
LIMIT 10;

    The output should look like this:





Hint #1 -- The data has moved in some cases

Hint #2 -- You can use JSON_UNQUOTE(JSON_EXTRACT(...)) or the ->> operator (See https://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#function_json-extract and https://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#operator_json-inline-path)

There is a solution below and your solution may be different than what is presented.






Solution:

SELECT city.name as 'City'
,city.Info->>"$.Population" as 'city pop'
,countryinfo.doc->>"$.Name" as 'country'
,countryinfo.doc->>"$.demographics.Population" as 'country pop'
,round(city.Info->>"$.Population" / countryinfo.doc->>"$.demographics.Population" * 100,2) as 'Percent'  
FROM city 
JOIN countryinfo on (city.CountryCode = countryinfo.doc->>"$._id") 
ORDER BY countryinfo.doc->>"$.Name",  city.name
LIMIT 10;


Monday, September 20, 2021

Dave's MySQL Quiz Number 2 Answers

    This week's MySQL was a great one for novices looking to level up from beginner level SQL development to a higher level.  

    The problem:  Find the customers with more than one unreturned rented movies that are past their return due date.  You might to look at this for the example of finding an overdue rentals. 

The answer:

First we need to get the customer_id from the customer table.  Then it takes a bit of struggle to get the information on the rental.   It is often easier to write queries by determining the needed output columns, then the 'qualifiers'  or stuff on the right of the WHERE clause before determining what has to be joined to get between the two.

The part of the query to find the overdue entries requires the rental date where it and the length of rental time are before the current date.  r.rental_date + INTERVAL f.rental_duration DAY < CURRENT_DATE()

Finding the those customers with more than one rental needs to have the count of r.rental_id greater than 1. 

So besides the rental table, we will need the inventory table to tie between the rental table and the film table. 


 SQL > select c.customer_id,    
sum(count(r.rental_id)) over (partition by c.customer_id) as'total',  r.rental_id, group_concat(f.title order by f.title) as 'titles' 
from rental r   
join inventory i on (r.inventory_id=i.inventory_id)   
join film f on i.film_id = f.film_id   
join customer c on r.customer_id = c.customer_id   
where r.return_date IS NULL   
AND r.rental_date + INTERVAL f.rental_duration DAY < CURRENT_DATE() 
group by c.customer_id  
having count(r.rental_id) >  1  
order by c.customer_id;
+-------------+-------+-----------+----------------------------------------+
| customer_id | total | rental_id | titles                                 |
+-------------+-------+-----------+----------------------------------------+
|          15 |     2 |     13798 | CANDIDATE PERDITION,SMOKING BARBARELLA |
|          42 |     2 |     13351 | RIVER OUTLAW,TORQUE BOUND              |
|          43 |     2 |     15644 | MOTIONS DETAILS,POLLOCK DELIVERANCE    |
|          53 |     2 |     11657 | LAWLESS VISION,PEACH INNOCENT          |
|          60 |     2 |     12489 | BOOGIE AMELIE,CHAMBER ITALIAN          |
|          75 |     3 |     13534 | LUST LOCK,SLEEPY JAPANESE,TROUBLE DATE |
|         107 |     2 |     13079 | BLADE POLISH,CLUB GRAFFITI             |
|         155 |     2 |     11496 | CHASING FIGHT,HYDE DOCTOR              |
|         163 |     2 |     11754 | HOLES BRANNIGAN,SONS INTERVIEW         |
|         175 |     2 |     13161 | DEER VIRGINIAN,PIRATES ROXANNE         |
|         208 |     2 |     13719 | CURTAIN VIDEOTAPE,SEATTLE EXPECATIONS  |
|         216 |     2 |     11676 | SWEDEN SHINING,WOMEN DORADO            |
|         228 |     2 |     12672 | CYCLONE FAMILY,GRAPES FURY             |
|         267 |     2 |     12066 | LUST LOCK,PHILADELPHIA WIFE            |
|         269 |     2 |     12610 | PRINCESS GIANT,THEORY MERMAID          |
|         284 |     2 |     12064 | BERETS AGENT,FRIDA SLIPPER             |
|         354 |     2 |     11782 | TITANIC BOONDOCK,TROJAN TOMORROW       |
|         361 |     2 |     13298 | HALF OUTFIELD,INSECTS STONE            |
|         448 |     2 |     13577 | FAMILY SWEET,STATE WASTELAND           |
|         457 |     2 |     12645 | CLEOPATRA DEVIL,GLEAMING JAWBREAKER    |
|         516 |     2 |     12130 | FALCON VOLUME,MINORITY KISS            |
|         560 |     2 |     12116 | MOVIE SHAKESPEARE,PIANIST OUTFIELD     |
|         576 |     2 |     11942 | TITANIC BOONDOCK,VANISHED GARDEN       |
+-------------+-------+-----------+----------------------------------------



Bonus:  Add the customer name


select c.customer_id, 
 concat(c.first_name, ' ', c.last_name) AS 'Customer Name',
  sum(count(r.rental_id)) over (partition by c.customer_id) as 'tots',  
 r.rental_id, group_concat(f.title) from rental r 
 join inventory i on (r.inventory_id=i.inventory_id) 
 join film f on i.film_id = f.film_id 
 join customer c on r.customer_id = c.customer_id 
 where r.return_date IS NULL 
 AND r.rental_date + INTERVAL f.rental_duration DAY < CURRENT_DATE() 
 group by c.customer_id
 having count(r.rental_id) >  1
 order by c.customer_id;

Friday, September 17, 2021

Dave's MySQL Quiz #2

     This week's MySQL uses the Sakila database (details on how to get this data) and this week's quiz is a  great one for those wanting to move from beginner level SQL development to a higher level.  There will be lots of tables to joins.

    The problem:  Find the customers with more than one unreturned rented movies that are past their return due date.  You might to look at this for the example of finding an overdue rentals.    You will need to display the customer's ID number, the number of overdue videos, and the names of the videos!  Bonus points for the customer name!

An answer will be posted Monday.