Open Source Databases: csv

The ability to import data in parallel is now a feature of the new MySQL Shell. And you can run it from a command line instead of from within an interactive shell.

How Does It Work?

The importTable utility, new on MySQL Shell 8.0.17, analyzes the input file, divides that file into chunks, and then uploads that data using parallel connections. This is much faster than the LOAD DATA statement. You get the ability to define how the data is delimited as far a field and lines are defined. And it works on DOS CSVs, Unix CSVs, TSVs, and JSON if that JSON is in one document per line mode. You also can adjust the number of threads, number of bytes sent per each chunk, and the maximum rate of data transfer per thread so you can balance the load on the network and the speed of data transfer.

The parallel table import utility works only on the MySQL Classic Protocol and not the newer X Protocol. The X Protocol connections do not support LOAD DATA statements. The parallel table import utility makes use of the LOAD DATA LOCAL INFILE statements to upload data chunks from the input file. Make sure that the data file you want to import is in a location that is accessible to the client host as a local disk. And local_infile system variable must be set to ON on the target server.

What to specify

Besides the name of your data file you can specify the schema, specify the table name, an array of column names to map to the table in cases where you data file does not have all the columns as the table, designate if you have unique keys (or if you desire duplicates), the termination of individual lines, the terminations of columns, what fields are enclosed by, the number of threads for uploading, the bytes per chunk, the maximum I/O rate, and the dialect of your data file (CSV, TSV, JSON). And you can get a status report on the progress.

From the Command Line Too

You can use util.inputTable from within an interactive shell or from a command line. The following is the import of one million rows from a command line.

$ mysqlsh mysql://root@localhost --ssl-mode=DISABLED -- util import-table foo.csv --schema=test --table=foo
Importing from file 'foo.csv' to table `test`.`foo` in MySQL Server at /var/lib/mysql/mysql.sock using 3 threads
[Worker000] test.foo: Records: 1496823 Deleted: 0 Skipped: 0 Warnings: 0
[Worker001] test.foo: Records: 4204841 Deleted: 0 Skipped: 0 Warnings: 0
[Worker002] test.foo: Records: 4298336 Deleted: 0 Skipped: 0 Warnings: 0
100% (117.80 MB / 117.80 MB), 490.81 KB/s
File 'foo.csv' (117.80 MB) was imported in 2 min 25.9817 sec at 806.94 KB/s
Total rows affected in test.foo: Records: 10000000 Deleted: 0 Skipped: 0 Warnings: 0

For details and more examples please see the MySQL Shell Manual.

Peter Zaitsev of Percona posted a Tweet about an interactive book Select Star SQL which 'aims to be the best place on the internet for learning SQL'. This interactive book uses a data set that can be downloaded or you could work the queries online. I showed the website to a neighbor who was asking 'what is the best way to learn SQL'. The neighbor wanted their own copy of the data and asked how do you turn a raw CSV file into a table.

Well, that can get complicated.

CSV

CSV files use commas to designate each field in the file. This is a great idea until someone plants a comma in the middle of a field but it is sort of a least common denominator way of passing data between systems.

But a CSV file does not directly fit into a relational table. You could shove all the data into a JSON column but that is not what we want. What is needed is a peek at the data in the CSV file to see what the various fields are and possible hints at the types of data therein. So download the data and please follow along.

Traditionally it took a bit of data sleuthing to examine the CSV file to determine how the data needed to be handled to fit into the relational table even before you fired up your database client.

If you open the file with a spreadsheet program like Libre Calc they usually will make a best effort at putting each field in its own column. Luckily this data set has the first row which names the columns. Are here are the column headings:

Execution
Date of Birth
Date of Offence
Highest Education Level
Last Name
First Name
TDCJ Number
Age at Execution
Date Received
Execution Date
Race
County
Eye Color
Weight
Height
Native County
Native State
Last Statement

Looking at the first few rows under the headings we can get a good glimpse at the data types for the columns. What we want to do on the first pass is to see if values are strings (CHAR), Integers, Realm and the like. It also helps to look for some field that can be used for a primary key, hopefully a column with unique values.

Execution INT Primary key?? Unique??
Date of Birth DATE
Date of Offence DATE
Highest Education Level INT
Last Name CHAR
First Name CHAR
TDCJ Number INT
Age at Execution INT
Date Received DATE
Execution Date DATE
Race CHAR
County CHAR
Eye Color CHAR
Weight INT
Height CHAR
Native County CHAR
Native State CHAR
Last Statement LONG TEXT

Second Pass

In this case it looks like we could use Execution as the primary key. Now we need to go through the data to see what fields need special handling. Are the possible INT fields all positive values (unsigned), could we use ENUMs anywhere (Maybe Native Country or Native State or Eye Color).

Also, for metric fans, the Height column uses imperial measurements which may not fit a company policy of metric only (or cubits). Plus Height has two components -- Feet and Inches -- that we may need to break down later or we may want to record all this data in inches. We will not cover that in this blog post (but we could cover if I get enough folks asking for how to do this).

Also we need to examine each of the CHAR fields to check their maximum lengths. This can be dangerous for future addendum where suddenly the maximum length set by our first pass at the data is swamped by some much longer fields. If the only Last Names we see are Smith and Jones, setting the maximum length of the CHAR field works until we run into someone named Buchanan or Schwarzenegger. But in this case we have a closed system (NO updates to the data) we will not worry about this, at least for his data

So making a best guess, lets assign lengths to the CHAR fields.

Execution INT Primary key?? Unique??
Date of Birth DATE
Date of Offence DATE
Highest Education Level INT unsigned
Last Name CHAR 25
First Name CHAR 25
TDCJ Number INT
Age at Execution INT
Date Received DATE
Execution Date DATE
Race CHAR 10
County CHAR 20
Eye Color CHAR 10
Weight INT
Height CHAR 10
Native County CHAR 25
Native State CHAR 25
Last Statement LONG TEXT

Please note that I will be using VARCHAR instead of CHAR in the above and use CHAR just for clarity.

Create Our Table

We could of course create a table like that below and then load the data file.

MySQL Workbench makes quick work of creating the schema. But Workbench will also has a wizard to open the CSV, let us name the columns, and import the data.

The Easy Way To Go From CSV to Relational Table

MySQL Workbench's import wizard is reached by clicking on a table icon and selecting Table Data Import Wizard.

Tell the import wizard where the CSV is located

Start the table import wizard

The CSV is split up by fields and we could edit field types

And the data is loaded

Now I have a big warning in that column names are copied from the CSV table first row. We could have changed Date of Birth to something like Birth_date but chose not to. This means we have to make queries like SELECT `Date of Birth` FROM tx_deathrow.

But now we have a local copy of the data to use when working through the interactive book.

Open Source Databases

Friday, August 2, 2019

Parallel Table Importer in MySQL Shell