Open Source Databases: March 2021

Monday, March 29, 2021

QUEST Insync & Open Source 101

Not one but two virtual presentations on March 30th for me.

First up at 10:45 Central is MySQL Indexes, Histograms Other Ways To Speed Up Queries and look for Session ID: 101290

Second at 3:45 Central is An Introduction to MySQL Indexes & Histograms (a shorter version of the first talk) and I will be staffing the virtual booth.

Slides will be posted at htps:://slideshare.net/davestokes

So if you really, really want more information on indexes and histograms then March 30th is your day!

Wednesday, March 24, 2021

From Spreadsheet to Database with MySQL Workbench

In the last post I covered some of the many reasons to migrate data from a spreadsheet to a database and now it is time to show how to do just that. Moving data from a spreadsheet into MySQL can actually be pretty simple in many cases. Now I will not mislead you by telling you that all cases are easy but there is an easy way to make this migration. And I highly recommend the following process when possible as it can save lots of time. More painful migrations will be covered in the future but this post is my favorite 'fast and easy' process.

This is a guide to taking data from a spreadsheet and easily moving that data into a MySQL database instance. The trick is the Import Wizard in MySQL Workbench that does a lot of the detail work for you. In a future post we will go into what you have to do when you are not using Workbench and have many more steps to take to get the job done.

You will find the data for this example at https://catalog.data.gov/dataset/texas-dot-txdot-work-zone-data and this dataset provides lane closure occurrences within the Texas Department of Transportation (TxDOT) highway system in a tabular format. By the way, Texans love to argue about who has the best BBQ and freeway construction. This is a continuously updatid archive of the TxDOT WZDx feed data can be found at ITS WorkZone Raw Data Sandbox and the ITS WorkZone Semi-Processed Data Sandbox. The live feed is currently compliant with the Work Zone Data Exchange (WZDx) Specification version 2.0. But most importantly for this blog is that it is a good example dataset. So please download it and import it into the spreadsheet of your choice.

1. Export Data

There are many ways to covert the data from a spreadsheet to something MySQL can consume. The easiest is to save it in CSV or Comma Separated Variables were every column of data has a comma between it and the next column. In Sheets you will use FILE->Download->CSV , Excel is File->Save As->CSV, and other spread sheet software will have something similar. Please not that for most programs this exports the current sheet and not the entire spreadsheet.

road_event_feed_info_feed_update_date,road_event_feed_info_version,road_event_id,subidentifier,road_name,road_number,direction,beginning_cross_street,ending_cross_street,beginning_milepost,ending_milepost,beginning_accuracy,ending_accuracy,start_date,end_date,start_date_accuracy,end_date_accuracy,event_status,total_num_lanes,vehicle_impact,workers_present,reduced_speed_limit,restrictions,description,issuing_organization,creation_date,update_date,types_of_work,lanes,geometry_linestring,geometry_multipoint

03/17/2021 04:58:33 PM,2,ELP_2019-08-01_00014+2021-03-17,ELP,SH-20,,eastbound,US-54,,,,Estimated,Estimated,03/17/2021 12:00:00 AM,03/17/2021 11:59:59 PM,Estimated,Estimated,,3,some-lanes-closed,,,[],2019 - 08/01 On-Going Closure - Right lane closure,TxDOT,,,[],"[{""lane_edge_reference"": ""left"", ""lane_number"": 1, ""lane_status"": ""open"", ""lane_type"": ""left-lane""}, {""lane_edge_reference"": ""left"", ""lane_number"": 2, ""lane_status"": ""open"", ""lane_type"": ""middle-lane""}, {""lane_edge_reference"": ""left"", ""lane_number"": 3, ""lane_status"": ""closed"", ""lane_type"": ""right-lane""}]",,MULTIPOINT ((-106.445503 31.773454))

The data above shows only the first two rows of data. The first row of data has the column headings and the second row is sample data. Workbench can read the first row and use those columns names to create the column names for the table in the database. It will read the other lines in the table to make an educated guess on the data types for the columns of the table.

2. Import Data

Start MySQL Workbench and login your to instance.

Open a new SQL tab by clicking on the icon indicated below.

Issue a CREATE DATABASE texdot; command in that new SQL tab. Do not forget the semicolon at the end of the line that indicates the end of the Structured Query Language (SQL) command.

Next execute that create database command by hitting the second lightening bolt icon. The second lightening bolt icon executes the SQL under the cursor.

You will see that the command has executed with the 1 row(s) effected message.

Now we have to tell Workbench that we want to use that new database we just created. No, that does not automatically happen. You have to select the little 'eyeball' icon on schema list to reload the list of available schemas (schemas and databases are used interchangeably in the MySQL world).

Now scroll down in the schemas window until you see your new database. Click on 'texdot' to tell Workbench that this is the database we want to use. When you click on it you will see underneath the active schema name in green.

Then select the right pointing arrowhead to expand the information on the schema.

Right click on the Tables line and you will see the Table Data Import Wizard option.

Now comes the neat stuff.

Enter the name of the file where the CSV data resides.

In my case it was in D:\tex\ and I was able to browse for it.

We want to create a new table in the texdot directory

And I was able to enter the name zone for the new table.

The wizard starts to work it's magic. It takes the data in the first row of the CSV file and converts them to file names. It then looks at the rest of the datafile to determine data type.

You will note that you can change them if you want and it will automatically make a best for types like JSON. Select Next when you are ready to proceed.

You are now ready to import. Select Show Logs from the lower left corner and then select Next.

The data is now uploaded.

The time to import is partially a function of the CSV file length but eventually you will get the Import data file finished message. Again select Next.

There are some statistics on the performance of the import that may be of use. In this case we import just over 1600 records. Select Finished.

Hit that eyeball icon again to reload the table table.

And we can now use SQL to query the data. Enter SELECT * FROM texas_dzone LIMIT 5; and again select that second lightening bolt icon. (please note that in an earlier run is uesed the name 'zone' instead of 'texas_dzone')

And the results of that query.

In just a few moments the data went from being in a spreadsheet to being in a database. No muss, no fuss. In the future we will look at datasets that take a little more 'engineering' to get into the database as there are cases where we and not the wizard have to do the 'heavy lifting'.

More on MySQL Workbench

For on MySQL Workbench please see https://www.mysql.com/products/workbench/ to download and https://dev.mysql.com/doc/workbench/en/ for the documentation.

Tuesday, March 16, 2021

Is a Database Just a Fancy Spreadsheet?!?

Want to make a data person groan? Ask them if a database is just a fancy spreadsheet. Then ask them to convert a spreadsheet to a database. Add in that there are lots of funny calculations, odd cells, and company 'secret' formulas.

What is a Spreadsheet?

For those of you who did not see the changes brought early personal computer spreadsheet software packages like VisiCalc and Lotus 1-2-3 back in the 1980's, they probably sold more personal computers (who needs more the 640K of memory?) to businesses than any other application. Accountants used to use real 'sheets' of paper 'spread' across their desks. Summing a column by hand or with a 10 key calculator and then writing down the results all day long was a common task (or career). Wikipedia describes a spreadsheet as a computer application for organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cells of a table. Each cell may contain either numeric or text data, or the results of formulas that automatically calculate and display a value based on the contents of other cells. A spreadsheet may also refer to one such electronic document.

Those spreadsheet power users who could program formulas, pivot data, and other tricks were very important persons. However there was no way for two or more people to share the same spreadsheet. At least if you wanted to ensure the integrity of the data or programmed formulas. No transactions, locking, or anything like that (and to honest not many of the databases back then had those features either). But the computer did sums, standard deviations, calculated depreciation, net present value, and other vital calculations, not a human.

Converting a Spreadsheet to a Database

The limitations of a spreadsheet can choke the growth of a small business and so there you can hear pleas for conversion help on a regular basis.

I am working on a new project to document how to do such a conversion. Right now I am still looking at a good source of data (please share if you have a good one) in Excel format (or similar) to show how to convert the data, how to use the Import Wizard in MySQL Workbench (and the utilities in mysqlsh too), common problems in import, checking the data for completeness, setting up structure, some normalizing, and then some query optimization.

One example of a typical problem found in a spreadsheet to database conversion is the need to scan the columns for consistency. If a column is supposed to yes/no, y/n, true/false, or some binary 0/1 variation it is almost inevitable to find an asterixis, a 'm' (for 'maybe'), or the 'See John for details on this'. Cells in a spreadsheet are much more flexible than typed column in a relational databases.

I will be posting over the little while the steps I am taking with this project. Hopefully it can be the first step is a series. Please let me know if you have input.

CSV or not to CSV

Comma Separated Variables are common way of outputting data from a spreadsheet to another spreadsheet or another platform. If you use Excel you version may offer three versions of CSV to pick from. Then you might need to demark the end of each line, the start & end of each cell, and other similar encodings.

JSON is another option but not for my version of Excel. Nor LibreOffice nor Google Sheets. And XML is not a favorite of mine.

SO CSV becomes the least common form factor. (insert golf clap sound effect here)

CIA World Factbook

The CIA World Factbook is the basis of the famous MySQL world and world_x databases used in MySQL documentation, classes, and other literature. It is still free but a MySQL devotee would not recognize the majority of the data. It had evolved a lot and I am tinkering with using this data to show how to do the conversion, use the migration wizard, at etcetera. And it is free.

Next Time

Next time we will look at the spreadsheet data and the CSV. And yes, as a preview of things to come, there are issues with the data. But that is for the next time.

Tuesday, March 2, 2021

Invisible MySQL?

Is MySQL going invisible? Invisible Indexes were included in MySQL 8.0 and now with version 8.0.23 we have Invisible Columns.

Indexes You Can Not See!

The value of the invisible index is that it allows you to make an index disappear from the view of the optimizer. In the distant days before 8.0, you would often delete an index you were pretty much definitively positive nobody or no query was using. And then you would find out that yes, not only was that index you just deleted necessary to everyone in the galaxy (but maybe you) but it was going to take some serious clock time to rebuild that index.

But with Invisible Indexes, you issue a command like ALTER TABLE t1 ALTER INDEX i_idx INVISIBLE; and it was removed from use. Now you can run EXPLAIN on your queries and compare results. And if you want that index back among the visible, ALTER TABLE t1 ALTER INDEX i_idx VISIBLE; returns you to full functionality.

I do recommend making an index invisible as part of a process of decommissioning an index, similar to a soft delete of a column to avoid hurried recreation. And the Sys Schema will show you indexes that have not been used, just make sure you have long enough of a time period to let those queries that only run once a week/month/quarter or longer show themselves.

Columns You Can Not See

MySQL 8.0.23 now allows you to have columns you can sort of not see. There are not really invisible or obfuscated but those columns are harder to see. If we create a table with an invisible column we have to explicitly call out that column to see the values as a * wildcard will not return the value.

SQL > create table stuff (id serial, c1 int, c2 int invisible);
Query OK, 0 rows affected (0.0393 sec)
SQL > insert into stuff (c1, c2) values row(1,2), row(3,4), row(5,6);
Query OK, 3 rows affected (0.0073 sec)
Records: 3 Duplicates: 0 Warnings: 0
SQL > select * from stuff; <- the wildcard, no invisible column
+----+----+
| id | c1 |
+----+----+
| 1 | 1 |
| 2 | 3 |
| 3 | 5 |
+----+----+
3 rows in set (0.0005 sec)
SQL > select id,c1,c2 from stuff; <- have to call c2 to see c2
+----+----+----+
| id | c1 | c2 |
+----+----+----+
| 1 | 1 | 2 |
| 2 | 3 | 4 |
| 3 | 5 | 6 |
+----+----+----+
3 rows in set (0.0005 sec)

So you could somewhat hide a column by using this method but please do not call this secure. This will allow you to add columns, say a primary key to a table lacking one, without having to worry about needing to modify existing queries.

And if you are creating a table from a table with an invisible column you need to explicitly reference it or you will not get that data.

Limitations?

The first limitation is that all columns can not invisible so at least one column needs to be visible. Columns can be defined as NOT NULL and have DEFAULT values.

What Else Is Going To Be Inviable?!

Well, in my job as a MySQL Community Manager, I do on rare occasion have access to software pre launch and I can tell you there are some really cool things in the product pipeline but I can not let you see them yet. (sorry, very bad joke)