Wednesday, October 27, 2021

Histograms and Faster MySQL Queries

     Histograms were introduced with MySQL 8.0 and are a valuable way of speeding up queries.  The MySQL optimizer assumes that data in a column has evenly distributed values. Even distribution of data probably does not reflect much of the data sitting right now in your database.  

    The optimizer wants to find the most efficient way to return the data requested in a query.  If it has poor information on that data, then the optimizer will make a 'guesstimate' that will will result in a query plan that will not perform well.  But if the optimizer has good information, in this case provided by a histogram, then it can produce a better query plan.

    In the following example a able is filled with data that is not evenly distributed.  In the histogram image following, the data is represented in what looks like a rollercoaster side view. 

create table dist (id serial not null primary key, 
                            x int unsigned not null);

insert into dist (x) value (1),(1),(1),(1),(1),
                                        (2),(3),(3),(3),(3),(3),(3),
                                        (4),(4),(5),(6),(6),(6),(6),
                                        (6),(6),(6),(8),(9),(9),(9),(9);

select x, count(x) from dist group by x;
+---+----------+
| x | count(x) |
+---+----------+
| 1 |        5 |
| 2 |        1 |
| 3 |        6 |
| 4 |        2 |
| 5 |        1 |
| 6 |        7 |
| 8 |        1 |
| 9 |        4 |
+---+----------+


Histogram
    There are 22 values of x that have a value less than seven.  If we examine output of a query where we are looking for the those values, the optimizer estimates, as seen in the EXLAIN output below,  it will need to roughly a third of the 27or 9 rows in the table. Here the optimizer has made a guess from assuming an even distribution, a third of 27 is 9.  It is easy to see that 9 is no where close to 22.


    Imagine a contractor estimates that it will take $9 to make you a widget but the final bill is $22.  Or your GPS application in your phone informs you that you are nine blocks from your destination but in reality is a much longer 22 blocks away.  In these two cases there may be valid reasons for the cost and distance 'overruns' but they are still frustrating to have to come up with the extra money of walk the extra distance.  Likewise this query generates a poorly performing  query plan.

 EXPLAIN select x, count(x) from dist where x < 7\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dist
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 27
     filtered: 33.32999801635742

        Extra: Using where
1 row in set, 1 warning (0.0007 sec)
Note (code 1003): /* select#1 */ select `fk`.`dist`.`x` AS `x`,count(`fk`.`dist`.`x`) AS `count(x)` from `fk`.`dist` where (`fk`.`dist`.`x` < 7)

    In this case a histogram provides a a better query plan. Creating a histogram is easy and in this case ten buckets will be used to store the values.

ANALYZE TABLE dist UPDATE HISTOGRAM ON x WITH 10 BUCKETS;
+---------+-----------+----------+----------------------------------------------+
| Table   | Op        | Msg_type | Msg_text                                     |
+---------+-----------+----------+----------------------------------------------+
| fk.dist | histogram | status   | Histogram statistics created for column 'x'. |
+---------+-----------+----------+----------------------------------------------+

    And rerun EXPLAIN.

EXPLAIN select x, count(x) from dist where x < 7\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dist
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 27
     filtered: 81.48148345947266
        Extra: Using where
1 row in set, 1 warning (0.0046 sec)
Note (code 1003): /* select#1 */ select `fk`.`dist`.`x` AS `x`,count(`fk`.`dist`.`x`) AS `count(x)` from `fk`.`dist` where (`fk`.`dist`.`x` < 7)

    81% of 27 is 22 which is the value of the number of rows where x is less than 7.  If the cumulative frequency of the bucket values is examined it is easy to see that the values less than 7 is indeed 81%.

SELECT (SUBSTRING_INDEX(v, ':', -1)) value,        
                concat(round(c*100,1),'%') cumulfreq,             
                CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq     
FROM information_schema.column_statistics,         
            JSON_TABLE(histogram->'$.buckets','$[*]'                 
                COLUMNS(v VARCHAR(60) PATH '$[0]',                    
            c double PATH '$[1]')) hist            
WHERE  table_name = 'dist'  and column_name = 'x';
+-------+-----------+-------+
| value | cumulfreq | freq  |
+-------+-----------+-------+
| 1     | 18.5%     | 18.5% |
| 2     | 22.2%     | 3.7%  |
| 3     | 44.4%     | 22.2% |
| 4     | 51.9%     | 7.4%  |
| 5     | 55.6%     | 3.7%  |
| 6     | 81.5%     | 25.9% |
| 8     | 85.2%     | 3.7%  |
| 9     | 100%      | 14.8% |
+-------+-----------+-------+



    Histograms are great for data that does not change frequently and unlike an index there is no ongoing maintenance overhead to impact performance.  Of course as the data changes, the value of the histogram degrades but they are easily updated.  
    

Friday, October 22, 2021

Does Column Order Matter in MySQL Multi Column Indexes

    Multi column indexes are a powerful way to speed up queries but they are often misunderstood.  In most other databases an index on columns a, b, and c can only be used when searching on columns (a,b,& c), (a & b), and (a)  -- according to the manual. Also that index supposedly can not be used to search for (b & c) or just (c).  Well, that is the way I learned it and the way I have been teaching it. But I was wrong!  Now would be a good time to read the MySQL manual on Multiple-Column Indexes as it does not work as noted (or see the excerpt below) and I assumed MySQL worked the same way as the other databases. Well, it doesn't!

Doubt me?  Well, lets create table and add in some data. 

Table and Data

SQL > create table abcd (a serial auto_increment primary key, b int, c int, d int);

Query OK, 0 rows affected (0.0404 sec)

 SQL > insert into abcd values (null,1,2,3),(null,4,5,6),(null,7,8,9);

Query OK, 3 rows affected (0.0081 sec)

Records: 3  Duplicates: 0  Warnings: 0

SQL > select * from abcd;

+---+---+---+---+
| a | b | c | d |
+---+---+---+---+
| 1 | 1 | 2 | 3 |
| 2 | 4 | 5 | 6 |
| 3 | 7 | 8 | 9 |
+---+---+---+---+

3 rows in set (0.0006 sec)

And then we need the index

SQL > create index bcd_index on abcd(b,c,d);


Testing 

    The first test we use the data from the first row where we look for the three columns (b,c,d) in the order specified in the creation of the index.  And guess what? It works as expected and uses the bcd_index.

SQL > explain format=tree select * from abcd where b=1 and c=2 and d=3\G
*************************** 1. row ***************************
EXPLAIN: -> Covering index lookup on abcd using bcd_index (b=1, c=2, d=3)  (cost=0.35 rows=1)

1 row in set (0.0006 sec)


    Leaving on the last column, searching on (b,c) also works as expected.

SQL > explain format=tree select * from abcd where b=1 and c=2\G
*************************** 1. row ***************************
EXPLAIN: -> Covering index lookup on abcd using bcd_index (b=1, c=2)  (cost=0.35 rows=1)

1 row in set (0.0008 sec)

    As does searching on just the first column (b)

SQL > explain format=tree select * from abcd where b=1\G
*************************** 1. row ***************************
EXPLAIN: -> Covering index lookup on abcd using bcd_index (b=1)  (cost=0.35 rows=1)

1 row in set (0.0006 sec)


    But what if we skip the (c) column, the one in the middle?  Well, I had thought that since (b,c) was not part of (b,c,d) or (b,c) or (b) as defined in the index then it could not use the index.

SQL > explain format=tree select * from abcd where b=1 and d=3\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (abcd.d = 3)  (cost=0.28 rows=0)
    -> Covering index lookup on abcd using bcd_index (b=1)  (cost=0.28 rows=1)

1 row in set (0.0007 sec)

    Well, I thought, maybe TREE format from EXPLAIN was not giving me enough data. So rerun EXPLAIN without TREE.

SQL > explain  select * from abcd where b=1 and d=3\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: abcd
   partitions: NULL
         type: ref
possible_keys: bcd_index
          key: bcd_index
      key_len: 5
          ref: const
         rows: 1
     filtered: 33.333335876464844
        Extra: Using where; Using index

1 row in set, 1 warning (0.0008 sec)
Note (code 1003): /* select#1 */ select `fk`.`abcd`.`a` AS `a`,`fk`.`abcd`.`b` AS `b`,`fk`.`abcd`.`c` AS `c`,`fk`.`abcd`.`d` AS `d` from `fk`.`abcd` where ((`fk`.`abcd`.`d` = 3) and (`fk`.`abcd`.`b` = 1))


    Okay, I know that on a (b,c,d) index that it is not supposed to work with a (d) search. Boy, was I wrong.

SQL > explain format=tree select * from abcd where d=3\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (abcd.d = 3)  (cost=1.30 rows=1)
    -> Index scan on abcd using bcd_index  (cost=1.30 rows=3)

1 row in set (0.0012 sec)

But the Manual!?!

The manual states:
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows. For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).

MySQL cannot use the index to perform lookups if the columns do not form a leftmost prefix of the index.

Well shoot!  This is a case when there is a difference between what the manual says and what the results are telling me.  Frankly, I find this pretty cool even if it makes me rethink the way I create indexes.   MySQL  is more flexible which gives you better performance

Friday, October 1, 2021

Have an Hour or So And Want To Learn MySQL?

 Want to learn MySQL? Have an hour or so?  Go to Oracle's Learning Explorer and sign up for the FREE MySQL Explorer learning path.  It begins with an overview of MySQL, discusses the client/server model, and then leads you through the use of MySQL.  You do not need previous MySQL or database experience.

MySQL Explorer

MySQL is the most-used opensource database and over the years I have has requests for introductory materials. Well, now I can send them to the MySQL Explorer for some web based training. And you can earn the explorer badge when you pass the end of class quiz.



Monday, September 27, 2021

Dave's Quiz 3 -- Change schemas

     This weeks quiz is a little different.  Many of you grew up with MySQL's sample world database but have not used the world_x database (see https://elephantdolphin.blogspot.com/2021/09/sample-databases-for-learning-mysql-and.html for details) and may be shocked to find that there is now a forth table to the database and some of the information is now stored in a JSON data type column. 

    The quiz is to take the following code that works with the world database and convert it to work with the world_x database.

SELECT city.Name as 'City'
    , city.Population as 'city pop'
    , country.name as 'country'
    , country.Population as 'country pop'
    , round(city.Population / country.Population * 100,2) as 'Percent'  
FROM city 
JOIN country on (city.CountryCode = country.Code) 
ORDER BY Country.Name,  city.name
LIMIT 10;

    The output should look like this:





Hint #1 -- The data has moved in some cases

Hint #2 -- You can use JSON_UNQUOTE(JSON_EXTRACT(...)) or the ->> operator (See https://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#function_json-extract and https://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#operator_json-inline-path)

There is a solution below and your solution may be different than what is presented.






Solution:

SELECT city.name as 'City'
,city.Info->>"$.Population" as 'city pop'
,countryinfo.doc->>"$.Name" as 'country'
,countryinfo.doc->>"$.demographics.Population" as 'country pop'
,round(city.Info->>"$.Population" / countryinfo.doc->>"$.demographics.Population" * 100,2) as 'Percent'  
FROM city 
JOIN countryinfo on (city.CountryCode = countryinfo.doc->>"$._id") 
ORDER BY countryinfo.doc->>"$.Name",  city.name
LIMIT 10;


Monday, September 20, 2021

Dave's MySQL Quiz Number 2 Answers

    This week's MySQL was a great one for novices looking to level up from beginner level SQL development to a higher level.  

    The problem:  Find the customers with more than one unreturned rented movies that are past their return due date.  You might to look at this for the example of finding an overdue rentals. 

The answer:

First we need to get the customer_id from the customer table.  Then it takes a bit of struggle to get the information on the rental.   It is often easier to write queries by determining the needed output columns, then the 'qualifiers'  or stuff on the right of the WHERE clause before determining what has to be joined to get between the two.

The part of the query to find the overdue entries requires the rental date where it and the length of rental time are before the current date.  r.rental_date + INTERVAL f.rental_duration DAY < CURRENT_DATE()

Finding the those customers with more than one rental needs to have the count of r.rental_id greater than 1. 

So besides the rental table, we will need the inventory table to tie between the rental table and the film table. 


 SQL > select c.customer_id,    
sum(count(r.rental_id)) over (partition by c.customer_id) as'total',  r.rental_id, group_concat(f.title order by f.title) as 'titles' 
from rental r   
join inventory i on (r.inventory_id=i.inventory_id)   
join film f on i.film_id = f.film_id   
join customer c on r.customer_id = c.customer_id   
where r.return_date IS NULL   
AND r.rental_date + INTERVAL f.rental_duration DAY < CURRENT_DATE() 
group by c.customer_id  
having count(r.rental_id) >  1  
order by c.customer_id;
+-------------+-------+-----------+----------------------------------------+
| customer_id | total | rental_id | titles                                 |
+-------------+-------+-----------+----------------------------------------+
|          15 |     2 |     13798 | CANDIDATE PERDITION,SMOKING BARBARELLA |
|          42 |     2 |     13351 | RIVER OUTLAW,TORQUE BOUND              |
|          43 |     2 |     15644 | MOTIONS DETAILS,POLLOCK DELIVERANCE    |
|          53 |     2 |     11657 | LAWLESS VISION,PEACH INNOCENT          |
|          60 |     2 |     12489 | BOOGIE AMELIE,CHAMBER ITALIAN          |
|          75 |     3 |     13534 | LUST LOCK,SLEEPY JAPANESE,TROUBLE DATE |
|         107 |     2 |     13079 | BLADE POLISH,CLUB GRAFFITI             |
|         155 |     2 |     11496 | CHASING FIGHT,HYDE DOCTOR              |
|         163 |     2 |     11754 | HOLES BRANNIGAN,SONS INTERVIEW         |
|         175 |     2 |     13161 | DEER VIRGINIAN,PIRATES ROXANNE         |
|         208 |     2 |     13719 | CURTAIN VIDEOTAPE,SEATTLE EXPECATIONS  |
|         216 |     2 |     11676 | SWEDEN SHINING,WOMEN DORADO            |
|         228 |     2 |     12672 | CYCLONE FAMILY,GRAPES FURY             |
|         267 |     2 |     12066 | LUST LOCK,PHILADELPHIA WIFE            |
|         269 |     2 |     12610 | PRINCESS GIANT,THEORY MERMAID          |
|         284 |     2 |     12064 | BERETS AGENT,FRIDA SLIPPER             |
|         354 |     2 |     11782 | TITANIC BOONDOCK,TROJAN TOMORROW       |
|         361 |     2 |     13298 | HALF OUTFIELD,INSECTS STONE            |
|         448 |     2 |     13577 | FAMILY SWEET,STATE WASTELAND           |
|         457 |     2 |     12645 | CLEOPATRA DEVIL,GLEAMING JAWBREAKER    |
|         516 |     2 |     12130 | FALCON VOLUME,MINORITY KISS            |
|         560 |     2 |     12116 | MOVIE SHAKESPEARE,PIANIST OUTFIELD     |
|         576 |     2 |     11942 | TITANIC BOONDOCK,VANISHED GARDEN       |
+-------------+-------+-----------+----------------------------------------



Bonus:  Add the customer name


select c.customer_id, 
 concat(c.first_name, ' ', c.last_name) AS 'Customer Name',
  sum(count(r.rental_id)) over (partition by c.customer_id) as 'tots',  
 r.rental_id, group_concat(f.title) from rental r 
 join inventory i on (r.inventory_id=i.inventory_id) 
 join film f on i.film_id = f.film_id 
 join customer c on r.customer_id = c.customer_id 
 where r.return_date IS NULL 
 AND r.rental_date + INTERVAL f.rental_duration DAY < CURRENT_DATE() 
 group by c.customer_id
 having count(r.rental_id) >  1
 order by c.customer_id;

Friday, September 17, 2021

Dave's MySQL Quiz #2

     This week's MySQL uses the Sakila database (details on how to get this data) and this week's quiz is a  great one for those wanting to move from beginner level SQL development to a higher level.  There will be lots of tables to joins.

    The problem:  Find the customers with more than one unreturned rented movies that are past their return due date.  You might to look at this for the example of finding an overdue rentals.    You will need to display the customer's ID number, the number of overdue videos, and the names of the videos!  Bonus points for the customer name!

An answer will be posted Monday.

Monday, September 13, 2021

Quiz #1 Answers

Here are answers to Quiz #1 and please note there are several ways to solve just about any problem so your answer may be different.

QUIZ #1:   How would you modify the following query on the world database to find only the official languages of each country?  Bonus - list all the official languages for each country on one row.

select country.name,
       countrylanguage.Language
from country
join countrylanguage 
    on country.code = countrylanguage.CountryCode);


First step is to see what in the countrylanguage table provides us on the official language status.

 SQL > desc countrylanguage;
+-------------+---------------+------+-----+---------+-------+
| Field       | Type          | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| CountryCode | char(3)       | NO   | PRI |         |       |
| Language    | char(30)      | NO   | PRI |         |       |
| IsOfficial  | enum('T','F') | NO   |     | F       |       |
| Percentage  | float(4,1)    | NO   |     | 0.0     |       |
+-------------+---------------+------+-----+---------+-------+

For those not used to dealing with enums they let enumerate a list of allowed values. Is this case IsOfficial can can either be 'T' or 'F'.  

I prefer to use alias on table names and below I use c for the country table and l for the countrylanguage table. Why? It improves readability and saves typing.    For extra karma points if you are a novice, rewrite the above query with alias.

The above query needs to qualify the query so that we receive only the rows where the column IsOfficial is equal to 'T'.  

select c.name,
       l.Language
from country c         
join countrylanguage l
     on  (c.code = l.CountryCode) 
where  l.IsOfficial = 'T';

Bonus

Listing all the official languages for each country on one row is a little more complex. First we have to GROUP BY each country.   If you simply add GROUP BY c.name you will find that you only get one official language in the cases there there  are many.    You also have to use GROUP_CONCAT() to provide all the possible official languages.  I also added aliases for the selected column names.

select c.name as 'Country',           
       GROUP_CONCAT(l.Language) as 'Official Languages'
from country c         
join countrylanguage l 
     on(c.code = l.CountryCode) 
where  l.IsOfficial = 'T' 
GROUP BY c.name;

The GROUP_CONCAT() adds the commas between each language automatically and you can change the separator as you desire.

But if you peer through the results there is one thing that may catch you eye that could be cleaned up for readability.  Take a look at the entry for South Africa where the languages are list but are not in alphabetical order.  Here is where you can add ORDER BY to the GROUP_CONCAT() to fix that issues.

Final code:


select c.name as 'Country',           
         GROUP_CONCAT(l.Language order by l.language) as 'Offical Langauges' 
from country c         
join countrylanguage l 
       on  (c.code = l.CountryCode) 
where  l.IsOfficial = 'T' 
GROUP BY c.name;




Thursday, September 9, 2021

Sample Databases for Learning MySQL and a QUIZ!!

     I frequently see requests on various web sites from MySQL novices looking for example databases with which to practice.  There are several that are highly recommended and we will take a look at them over the next few weeks.

1. World Data Base -- See https://dev.mysql.com/doc/world-setup/en/

This has been THE database used in MySQL's documentation, classes, examples, and just about anything else.  There are three tables -- city, country, and countrylanguage -- that are fantastic for teaching the basics such as JOINs, GROUP BY, HAVING, and other SQL commands.  

1x. World X Database -- See https://dev.mysql.com/doc/world-x-setup/en/

With the advent of the X DevAPI there needed to be a test database with JSON columns and the World Database was modified.  The countryinfo table was added to the inherited three so that you can practice with JSON columns and a JSON document collection.

3. Sakila -- See https://dev.mysql.com/doc/sakila/en/

Once upon a time, if you wanted to see a movie in your home you had to first go to a store filled with BETAMAX and VHS format tapes and rent them.  The Sakila database is data representing the operations of such a store.  This is a bigger set of data than the previous two and provides twenty three tables giving novices an opportunity to join more tables together.  

4. Northwind -- Various sites

Northwind is a fictitious trading company and the sample dataset came from Microsoft. There are various version available for MySQL and they feature ten  tables. 

In the coming weeks I will work with various data sets hopefully showing novices how to use these test databases to expand their knowledge.  And if it proves popular, there will be a quiz each week featuring one of the above.

QUIZ:  How would you modify the following query on the world database to find only the official languages of each country?  Bonus - list all the official languages for each country on one row.

select country.name,
          countrylanguage.Language
from country
         join countrylanguage on
               (country.code = countrylanguage.CountryCode);


Friday, September 3, 2021

Southern California Linux Expo 19x MySQL Track

    There will be a MySQL tract at the Southern California Linux Expo 19x (SCaLE)  March 3-6, 2022 in Pasadena, California.   I will organizing the track and the first part of that is asking you to submit talks for the MySQL track.  We needs talks for novices, we need talks for the advanced, and we need talks for those in between!  November 20th, 2021 is the deadline for abstracts/proposals submissions.  

    SCaLE is the largest Linux show in Southern California and not only has amazing talks on anything open source but the expo hall is not to be missed.  Pasadena is an amazing city with great food within walking distance of the expo hall.  Plus they have been recording America's Got Talent next door - please note Simon Cowell's knowledge of shell scripting is suspect so double check any code he offers you.


Seriously, this is an amazing event.  I need 8-10 hours of presentations!  If you have questions, need help with your proposal, or have questions about SCaLE please email me - David.Stokes at Oracle.com   -- I will help you get your proposal ready!

If you have not been to SCaLE before, I urge you to attend as the price is beyond reasonable, the crowd is amazing, and the presentations are fantastic.  And I hope one of the talks in the MySQL Track is yours!

Tuesday, August 31, 2021

The MySQL Shell and the MySQL Config Editor

    One of the things I love about the MySQL Shell is that is saves your authentication credentials (no longer just passwords) in an obfuscated login path file named .mylogin.cnf. You can find that file in the %APPDATA%\MySQL directory on Windows or the current user's home directory on non-Windows systems. The file can be read later by MySQL client programs, such as the MySQL Shell, to obtain your authentication credentials for connecting to MySQL Server.

Which Credentials?

I had an email this morning from someone wanting to a) change the password stored by mysqlsh and b) find out which credentials are stored on the shell account being used.

Let's start out in reverse order.  The mysql_config_editor will show you the accounts with credentials saved for that login. 

$ mysql_config_editor print --all
[root@localhost]
user = "root"
password = *****
host = "localhost"
[demouser@localhost]
user = "demouser"
password = *****
host = "localhost"

The two accounts on the shell account being used are root@localhost and demouser@localhost.

Quick Password Reset

To reset the server password on the second account use ALTER USER 'demouser'@'localhost' IDENTIFIED BY '<newpasswordhere>';  

If you login again from mysqlsh and the the stored password does not work, you will be prompted for the new password.  And that new password will be encrypted in stores in the .mylogin.cnf file. 

Start From Scratch

Is you want remove all the stored authentication information then type mysql_config_editor reset and all will be removed. 


Monday, August 30, 2021

Novices and Their SQL Misconceptions

     Do yourself a favor and spend ten minutes watching this video on the CER2021 paper Identifying SQL Misconceptions of Novices: Findings from a Think-Aloud study.  Structured Query Language is hard for many new programmers as it is a declarative language which makes it much different that the object orientated or procedural languages with which they started their software journeys. 

    The authors have categories problems in four general misconceptions programmers have -- Previous Knowledge, Generalization, Language, and Bad Mental Model. Those are then refined more and I am sure anyone who has to fix poor queries will immediately recognize. 

    Why novices should read this:  Structured Query Language or SQL is a lot different than other languages and its logic is designed around data sets.  The assumptions you bring in from other languages can hinder your developing SQL skills.   The syntax and keywords are not what you are used to and frankly things about data groups instead of one row of data can be hard for some to master.  This video and the paper will help you hurdle those knowledge gaps.

    Why SQL professional should read this:  After you have mastered SQL, it is easy to forget the points that you stumbled over learning it.  it will also help you identify where your journals flounder so that hopefully you can aid them.

A few times each week I run across someone asking why SQL is 'so weird', where is the 'for....next' or while (...) flow control, or why you can not have a table of tables pointing to other tables.  MySQL is a very easy database to learn but there is still a learning curve and trying to make it act like other software many not work.

So a big round of applause to the authors : Daphne Miedema, Efthimia Aivaloglou and George Fletcher. 

Tuesday, August 24, 2021

PHP MySQL Connectors -- MySQLi, PDO, and/or X DevAPI

 Help!

I am preparing a presentation for the Longhorn PHP conference titled PHP & MySQL -- How do PDO, MySQLi, and X DevAPI do what they do.  This will be a comparison of using PHP with different MySQL Connectors.  As far as I can tell there are no three way comparisons of mysqli, PDO, and the X DevAPI PECL extension.  And much of the PDO versus myqli stuff looks to me like it has not aged well. 

I have good material in the raw presentation about overall features, parameters for prepared queries. And a good section on the how's and whys on prepared queries.  

But what else would you like to see in such a presentation?  I have read some postings on the web about turning off buffering (fairly simple to do).  But what else are would you like to see compared?

Performance?  Well, my rudimentary and preliminary benchmarks do not really so much of a difference running on the same platform against the data.  But do you know a case where one shines over the others?

Coding style?  Are you more object orientated or procedural?  So you like the you can use the X Protocol to skip having to write structured query language.

So if you have any input on what you would like to see in this presentation PLEASE send it to me!

Thursday, August 19, 2021

EXPLAINing the Different EXPLAINS In MySQL

     The main tool for tuning MySQL queries is the EXPLAIN statement (https://dev.mysql.com/doc/refman/8.0/en/explain.html) and one of the hurdles on the learning curve of the EXPLAIN statement is explaining what EXPLAIN is explaining.  Basically EXPLAIN is prepended to a SELECT, TABLE, DELETE, UPDATE, INSERT or REPLACE statement.  To add to an already steep learning curve is that there are many types of EXPLAIN in MySQL.

Let Me Explain

It is very simple to add EXPLAIN at the very beginning of a query to see how the server wants to execute a query. 

EXPLAIN SELECT col1, col2 FROM my_table;

The output will show the query plan (the actual query the optimizer will instruct the MySQL server to run) and some preliminary information about how the query plan was picked among the many possible options.  Learning to use EXPLAIN to tune queries is a long process and beyond the scope of this simple article.   This article is about the various EXPLAIN variants available with MySQL 8.0.

Starting with a relatively simple query using the tried and true world database that MySQL has used forever for examples, we will join two tables.

SQL > SELECT city.name as 'City',
             country.name as 'country' 
           from city 
           join country on (city.countrycode = country.code) 
           where district = 'Texas';
+----------------+---------------+
| City           | country       |
+----------------+---------------+
| Houston        | United States |
| Dallas         | United States |
| San Antonio    | United States |
| Austin         | United States |
| El Paso        | United States |
| Fort Worth     | United States |
| Arlington      | United States |
| Corpus Christi | United States |
| Plano          | United States |
| Garland        | United States |
| Lubbock        | United States |
| Irving         | United States |
| Laredo         | United States |
| Amarillo       | United States |
| Brownsville    | United States |
| Pasadena       | United States |
| Grand Prairie  | United States |
| Mesquite       | United States |
| Abilene        | United States |
| Beaumont       | United States |
| Waco           | United States |
| Carrollton     | United States |
| McAllen        | United States |
| Wichita Falls  | United States |
| Midland        | United States |
| Odessa         | United States |
+----------------+---------------+
26 rows in set (0.0017 sec)

Prepending EXPLAIN to the query gives us the following:


SQL > EXPLAIN SELECT city.name as 'City', 
                     country.name as 'country' 
          from city j
            on country on (city.countrycode = country.code) 
         where district = 'Texas'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: city
   partitions: NULL
         type: ALL
possible_keys: CountryCode
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4188
     filtered: 10
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: country
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: world.city.CountryCode
         rows: 1
     filtered: 100
        Extra: NULL
2 rows in set, 1 warning (0.0006 sec)
Note (code 1003): /* select#1 */ select `world`.`city`.`Name` AS `City`,`world`.`country`.`Name` AS `country` from `world`.`city` join `world`.`country` where ((`world`.`country`.`Code` = `world`.`city`.`CountryCode`) and (`world`.`city`.`District` = 'Texas'))

By the way, that last line that starts with ‘Note’ is the query plan or the query the server will run to return the data requested in the query.  You may notice that the query plan’s query is the optimized version of the original query and in this case it looks a bit different.

As mentioned earlier, the peering into of the entrails of the output is past the scope of this article. But the output above may overwhelm a novice to those using EXPLAIN but there is, believe it or not, some information missing.  We can modify the EXPLAIN statement with a change in output format.  

TREE Format

FORMAT=TREE provides information on the cost of the queries. What is that important? MySQL and most other relational databases use a cost based optimizer that evaluates the many choices to retrieve the data requested in a query.

Also the TREE format is a little more verbose in some areas.  In the previous example EXPLAIN does tell those familiar with the terse out that a table scan is performed on the city table while the TREE format explicitly reports this fact. So TREE format is a little easier to read but does not have as much detail as the ‘traditional’ version.

SQL > EXPLAIN FORMAT=TREE SELECT city.name as 'City', country.name as 'country' from city join country on (city.countrycode = country.code) where district = 'Texas'\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop inner join  (cost=822.91 rows=419)
    -> Filter: (city.District = 'Texas')  (cost=425.05 rows=419)
        -> Table scan on city  (cost=425.05 rows=4188)
    -> Single-row index lookup on country using PRIMARY (Code=city.CountryCode)  (cost=0.85 rows=1)
1 row in set (0.0006 sec)

JSON Format

But what if you need more information, even more than the traditional option. Well, there is the JSON format. If you want details, this format is for you!

SQL > EXPLAIN FORMAT=JSON SELECT city.name as 'City', country.name as 'country' from city join country on (city.countrycode = country.code) where district = 'Texas'\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "822.91"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "city",
          "access_type": "ALL",
          "possible_keys": [
            "CountryCode"
          ],
          "rows_examined_per_scan": 4188,
          "rows_produced_per_join": 418,
          "filtered": "10.00",
          "cost_info": {
            "read_cost": "383.17",
            "eval_cost": "41.88",
            "prefix_cost": "425.05",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "Name",
            "CountryCode",
            "District"
          ],
          "attached_condition": "(`world`.`city`.`District` = 'Texas')"
        }
      },
      {
        "table": {
          "table_name": "country",
          "access_type": "eq_ref",
          "possible_keys": [
            "PRIMARY"
          ],
          "key": "PRIMARY",
          "used_key_parts": [
            "Code"
          ],
          "key_length": "3",
          "ref": [
            "world.city.CountryCode"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 418,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "355.98",
            "eval_cost": "41.88",
            "prefix_cost": "822.91",
            "data_read_per_join": "107K"
          },
          "used_columns": [
            "Code",
            "Name"
          ]
        }
      }
    ]
  }
}
1 row in set, 1 warning (0.0010 sec)
Note (code 1003): /* select#1 */ select `world`.`city`.`Name` AS `City`,`world`.`country`.`Name` AS `country` from `world`.`city` join `world`.`country` where ((`world`.`country`.`Code` = `world`.`city`.`CountryCode`) and (`world`.`city`.`District` = 'Texas'))

VISUAL Explain

Well the tree options shown so far are not exactly esthetically pleasing but they are useful for tuning queries.  But there is another EXPLAIN that is available in MySQL Workbench to provide yet another look at how the query works. VISUAL Explain provides much of the information from the other EXPLAINs but in a visually pleasing format.  And it is a great addition to documentation on queries.  

 











Estimated Performance versus Real Numbers

Now you know the difference between the various EXPLAIN formats available in MySQL 8.0.  But all those numbers and information are based on estimates from the server statistics.  Sadly those statistics may be old or skewed by changes to the database. The query itself is not run and the optimizer is only using historical data which may not be accurate.  If you need real statistics there is an ability to add ANALYZE to any EXPLAIN statement to have the query run to see the actual performance numbers.

SQL > EXPLAIN ANALYZE FORMAT=TREE SELECT city.name as 'City', country.name as 'country' from city join country on (city.countrycode = country.code) where district = 'Texas'\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop inner join  (cost=822.91 rows=419) (actual time=1.611..1.734 rows=26 loops=1)
    -> Filter: (city.District = 'Texas')  (cost=425.05 rows=419) (actual time=1.580..1.692 rows=26 loops=1)
        -> Table scan on city  (cost=425.05 rows=4188) (actual time=0.051..1.265 rows=4079 loops=1)
    -> Single-row index lookup on country using PRIMARY (Code=city.CountryCode)  (cost=0.85 rows=1) (actual time=0.001..0.001 rows=1 loops=26)
1 row in set (0.0023 sec)

Hopefully you can now explain the various EXPLAIN output formats and as you work to tune your queries. Each has their use in query tuning and making your MySQL server more responsive.


JavaScript and Friends this Friday

 JavaScript and Friends starts tomorrow and I will be speaking on JavaScript and the MySQL Shell.  JavaScript is one of the three modes of input for mysqlsh! 

Plus you see how to register reports with the shell so they will be ready when you are.

Tickets range from $0 to $75.



Friday, August 6, 2021

Finding Your Hidden InnoDB Primary Indexes

     InnoDB is the default storage engine for MySQL and InnoDB prefers that there is a PRIMARY KEY defined when a table is defined.  Having a unique non-nullable primary key can vastly speed up queries and data is stored by the primary key in a B+ Tree structure.  

    What if a primary key is not defined, InnoDB will use the first unique key defined as NOT NULL.  Failing that, InnoDB generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column that contains a 6-byte automatically increasing number when rows are inserted. This is a key that you can not use for searches (it is hidden from you!) and is not directly benefitting you. And that is probably not what you want.

    To find those columns you need to look in the INFORMATION_SCHEMA with a query like this:

SELECT i.TABLE_ID,
       t.NAME
FROM INNODB_INDEXES i
JOIN
     INNODB_TABLES t ON (i.TABLE_ID = t.TABLE_ID)
WHERE
    i.NAME='GEN_CLUST_INDEX';

    Run this on your instance and see if there are any surprises where you thought you had a primary key but in reality you do not.

+----------+----------------------+
| TABLE_ID | NAME                 |
+----------+----------------------+
|     1204 | json/t1              |
|     1206 | so/testfloat         |
|     1209 | so/tab1              |
|     1210 | so/q                 |
|     1211 | so/j                 |
|     1238 | sunshine/gentest     |
|     1368 | testx/austin         |
|     1374 | gis/geometries       |
|     1376 | gis/xy               |
|     1377 | gis/geom             |
|     1386 | gis/worldmap         |
|     1381 | gis/geometry_columns |
|     1407 | gis/mypoint          |
|     1409 | gis/zipcode          |
|     1410 | zip/zipcode          |
+----------+----------------------+
15 rows in set (0.0060 sec)


    Can you leave them like this, with the hidden index? Yup, sure you can. If the columns with the name GEN_CLUST_INDEX are there then the server has been making some use of them.

    Or you can run ALTER TABLE and designate an existing column as a primary key or to add a new column for that purpose.

Or add a invisible column to be the new primary key if you have any doubts about not bothering existing queries that would be bothered by the sudden appearance of a new column in the data.

Thursday, August 5, 2021

Work at MySQL

 I was asked today what opportunities Oracle and MySQL had for new college graduates.  I referred them to Start Your Success Story as the place to start. Oracle is a big company and always on the lookout for talent.  The list of internships & programs is extensive. 

But  specifically what about MySQL?  For someone who is not a recent graduate?


The Oracle Job Search page is a great place to start.  If you enter the keyword 'MySQL' today, you will find over ninety jobs listed in jobs ranging from Quality Analysis, software engineers, sales, technical writers and more.  

If you don't see an exact match for your skilled, send in your resume or CV as the hiring managers are aggressively looking for skilled individuals and the web site might be just a smidgeon out of date.

Tuesday, August 3, 2021

Writing Data From the Pandas Data Analysis Library To MySQL

    In the last installment the flexible and easy to use open source data analysis and manipulation tool     Pandas, built on top of the Python programming language, was used with the MySQL Shell, aka mysqlsh.  Pandas is a very popular open source project that features data structures and tools to manipulate numerical tables.  And the name is derived from Panel Data and not the bamboo chewing critters. 

    Tasks like generating a series is simple:

dstokes@localhost pandas]$ python
Python 3.9.6 (default, Jul 16 2021, 00:00:00) 
[GCC 11.1.1 20210531 (Red Hat 11.1.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series([1, 3, 5, 11, 29, 42])
>>> s
0     1
1     3
2     5
3    11
4    29
5    42
dtype: int64
>>> 

    Data frame is a table presentation concept that is easy to use.  Below some data on various guitars is presented:

>>> df = pd.DataFrame(
...  { "Brand": 
    [ "Fender Telecaster", "Gretsch 5420", "Gibson Flying-V"],
...    "color": [ "blonde", "black", "natural"],
... }
... )
>>> df
               Brand    color
0  Fender Telecaster   blonde
1       Gretsch 5420    black
2    Gibson Flying-V  natural
>>> 

    Hmm, does the above data frame definition look a bit like JSON to anyone besides me?  I will have to experiment with loading and saving from MySQL's JSON data type.  In a future installment I will dig a bit more deeply.

But How Do You Save Data??

    The last blog on Pandas generated some questions about saving data from Pandas into MySQL.  

    In the example below the data from a series is saved to a table.  This time we are not using mysqlsh and instead uses a overly simple script that uses the MySQL Python connector.   

    Numpy's random number generator is combined with Panda's one dimensional array class  s = pd.Series(np.random.randn(1)) to produce a numeric value.

#!/usr/bin/python
import pandas as pd
import numpy as np
import mysql.connector
import random
import string

#Generate a random letter
randomLetter = random.choice(string.ascii_letters)

#Generate a random number
s = pd.Series(np.random.randn(1))

try:
  mydb = mysql.connector.connect(host="localhost", 
         database="bear", user="panda", passwd="123Bear!")
  mycursor = mydb.cursor()
  sql = "INSERT INTO t1 (myletter, myvalue) VALUES (%s, %s)"
  val = [randomLetter ,  str(s[0])]
  mycursor.execute(sql,val)
  mydb.commit()

except Exception as e:
  mydb.close()
  print(str(e))

Now there are other connectors but my preference is the MySQL Python Connector for the reasons that it works exceedingly well and in simple to use.  Now this is a very simple example and yes the code works within a script and within mysqlsh.  

Tuesday, July 20, 2021

Pandas and The MySQL Shell

    Pandas is a Python software library for data analysis and manipulation.  It allows you to import data from CSV files, SQL, and Excel.  And it is open source.

    Pandas works well with the MySQL Shell.  After installing Pandas with pip install pandas on my Fedora test system, I was able to load Pandas.  The documentation is excellent at pandas.pydata.org   and I was quickly able to get one of their examples working from within mysqlsh.


Did I mention SQL as a source of data? Reading from MySQL is very easy.  


import mysql.connector as connection
import pandas as pd

try:
    mydb = connection.connect(host="localhost", database = 'world',user="panda", passwd="123Bear!")
    query = "SELECT * FROM country LIMIT 5;"
    result_dataFrame = pd.read_sql(query,mydb)
    mydb.close() #close the connection

except Exception as e:
    mydb.close()
    print(str(e))

    I will investigate Pandas more and report anything I find.




MySQL 8.0.26 And Interesting Items in the Release Notes

You may have noticed that MySQL 8.0.26 was released today (yea!)  and I recommend reading the release notes but here are some of the highlights.

  • TLS version 1 and 1.1 are deprecated. Please use later versions such as 1.2 and 1.3 (you made need OpenSSL 1.1.1 or higher too).
  • More information in the server log on client timeouts includes the timeout value, and client user and host when that information is available
  • Internal functions to copy values between columns are improved and test at about11% faster (YMMV)
There are a lot of interesting bits of information in the release notes and they show the amount of work that the MySQL Engineering Teams put into a release to give you a better product.

But What Are Those Warnings?

But some of you may be seeing the following messages:

The syntax 'sync_master_info' is deprecated and will be removed in a future release. Please use sync_source_info instead.

'NO_ZERO_DATE', 'NO_ZERO_IN_DATE' and 'ERROR_FOR_DIVISION_BY_ZERO' sql modes should be used with strict mode. They will be merged with strict mode in a future release.

What does that mean?

The first warning is part of a continuing process of replacing offensive terms (see MySQL Terminology Updates) and the new syntax is sync_source_info, replacing sync_master_info. These changes are much more involved than a simple global search and replace in the source code.  According to the release notes most of the changes have been made along with the corresponding help text.

The second is a warning about upcoming changes in STRICT MODE (stricter mode?) adding settings.

From the MySQL Manual

The default SQL mode in MySQL 8.0 includes these modes: ONLY_FULL_GROUP_BYSTRICT_TRANS_TABLESNO_ZERO_IN_DATENO_ZERO_DATEERROR_FOR_DIVISION_BY_ZERO, and NO_ENGINE_SUBSTITUTION.

And also peek at Strict SQL Mode 

And as of 8.0.26 you can set your session to strict mode only if you the privileges' to set restricted variables. 

As usual you can download from https://www.mysql.com/downloads/

Tuesday, July 13, 2021

Do You Really Need to Upgrade from MySQL Native Password to Caching SHA256 Password Authentication?

    If you have not read MDS, PHP and authentication by my brilliant colleague LeFred please do so. The TL;DR is that most recent versions of the PHP support the MySQL 8.0 standard for authentication, Caching SHA 256, but with PHP 7.3 and earlier you will have to use the older MySQL Native Password authentication.

    Someone reached out to me with a direct message to ask me if they really needed to change from MySQL native to Caching SHA 256.  

    Well, what versions of MySQL are you using?  The good news is that they are running 8.0.23 in production, '25 in test and development. But all the accounts are using MySQL Native Authentication.

    What versions of PHP?  There was some hesitation before this question was answered.  Dev is 8.0.8, test is a mixture 8.0.8, 7.4.21, and 7.4.21, and production 7.4.21. With one little exception of 7.4.20 that they are trying to move away from, they promised, by the end of the month.

    Do we have to re-do all the accounts to access MySQL to this new method?  Well, it depends.   In this case they had several projects using for the most part unique schemas. And each had separate usernames & authentication strings -- they were not using the MySQL root account for everything thing.  Passwords were rotated a few times each year.  

    No direct access from the internet to the instances?  Well, there is one application that directly connects through a dedicated port on the firewall. Everything else that touches the MySQL instances are within the firewall.  And nobody really seems to check on just who is using that connection or audits activity on that direct connection.  

    The Caching SHA 256 method is much more secure hashing method.  It is required to use either a TLS connection or an unencrypted connection that supports password exchange using an RSA key pair.  Please see SHA-256 Pluggable Authentication. By default MySQL 8.0 sets up secure connections between the clients and servers.

    I asked if there was anything in their data that was sensitive, proprietary, covered by privacy laws, or anything they just did not want made public. Well, yes there was.   

Do They Need to Upgrade?

    I told this person, no, you do not need to immediately upgrade from MySQL native to SHA 256 but I would suggest that they plan to upgrade over the next several months.  Other than being paranoid about the one direct connection through the firewall, it sounded like they were keeping on top of things.  

    Being paranoid about the security of your data should be the default mode of and DBA or developer.  Going with a lower standard than the MySQL 8.0 defaults means you are actively downgrading your protection.  

    I recommend reading Protecting MySQL Passwords With the sha256_password Plugin from the MySQL Server Team Blog for more technical details on using SHA 256 authentication.