Look at the data

posted: June 22, 2019

tl;dr: The first and last step in any data programming task is to look at the data...

Before starting on any programming task involving data, I always spend a considerable amount of time looking at the data. I won’t write a line of code until I’ve thoroughly analyzed, by visual inspection, the data that I am dealing with. Most ETL (Extract, Transform, and Load) data jobs, outside the realm of data science, tend not to require sophisticated programming techniques. Rather, the main qualities required of the programmer are fastidiousness, attention to detail, and tenacity. A data job isn’t truly complete until all the corner cases are properly handled.

If the data is coming from a database, I’ll connect to the database and look at the tables, columns, and rows by writing some queries. If it is from a spreadsheet, I’ll fire up the spreadsheet program, or if it is from a CSV file I’ll view it with a CSV file reader. If the data is coming directly from user input, such as from a webform or a user interface, I’ll look at some prior user entries, if they exist.

Over the years I’ve learned to look for many potential pitfalls in the data. Some of the questions I attempt to answer are:

How are the records organized or sorted? Chronologically, by unique ID, or in some other way?
Do there appear to be gaps in the records that may indicate some data is missing? This often happens when the data is in more than one file.
Do the unique IDs truly appear to be unique?
Are there duplicate records? Hint: there nearly always are.
Is the data well-populated or sparse?
Which fields/columns always have a value? Which do not?
Are there occasional NULL values in fields that should always have a value?
Do the values in the fields appear to fall within expected, valid ranges? Or are there living people who are 200+ years old, or other values that are well outside the realm of the possible?
For string fields, is the text nice and clean or are there occasional garbage values or misspellings?
Does the text sometimes have additional whitespace characters in it?
Is the text consistently formatted and capitalized in the same manner across all records?
Is there a need to transform the case of the text, for example converting ALL CAPS to something that will look better when the data is loaded into the new system?
For enumerated fields, are the values always one of the possible enumerated values, or are there some strange values on occasion?
For string fields used as booleans, such as ‘Y’ and ‘N’, do all the records have one of those two values or are there occasionally other values?
For string fields, do all the characters fall within the ASCII character set, or are there non-ASCII characters?
Does the encoding of the data appear to be correct, or are there occasionally some weird-looking characters that may indicate an encoding mismatch?
For datetime and time values, are they UTC times or timezone-delineated? If not, what timezone do the times appear to be from?
Do some of the records have values that are misplaced, by being in the wrong field/column?
In CSV files, how are strings with commas in them delimited? This may require firing up a text editor.
If the data is in more than one file, do all the files have exactly the same column headers, or are there inconsistencies across the set of files?
If the data is supposed to be an incremental update of records whose values have changed, does that set of incremental records appear to be complete? Or is it too voluminous, by supplying records whose values haven’t actually changed?
Do fields that are supposed to match up across files or tables actually match up? Can the expected relationships actually be formed?

Table Tool is a decent CSV file viewer for MacOS

This assessment tells me what problems exist in the data. Some of these may need to be corrected at the source, and some may be able to be cleaned up in code.

Then, still before I write a line of code, I’ll run some records through the desired “transform” algorithm in my head to see if the results look good. Sometimes the transform algorithm doesn’t work when applied to the real-world data and will need to be modified. If you jump too quickly into coding an algorithm that doesn’t actually work with the given data, you’ll just end up with a large number of bad results and wasted coding effort.

Finally, after determining that the algorithm might actually work, and figuring out how I’ll need to clean up the data, I’ll write the code. This is the fast, easy, and fun part. When I write the code, I’ll put in a bunch of counters to keep track of how many records fall into certain categories, as well as some print/log statements for certain rare cases to see if they actually occur. These counters and log statements are one way to know that the code is actually working as expected. They can also be used for monitoring the job during future runs.

I’ll start by running a small number of records through my code and closely inspecting the results. Only when I have confidence that a small set of resulting records look good will I increase the sample size. I’ll typically bump it up in steps of an order of magnitude or two before I launch a job that processes all the records.

One other test that I like to run, especially when dealing with data jobs that take input from users, is what I call the emoji test 😀. Unicode encoding issues may exist somewhere across the multiple systems in the data path. To surface these I’ll put in some records with emoji characters in the text fields and see if they make it all the way through.

This methodology has worked for me. The most important step, both at the beginning and end of writing any data job, is to look closely at the data.

Related post: Play with the data

Related post: Count the data

Related post: De-duplicating database records