Count the data

posted: August 22, 2020

tl;dr: How the simple act of counting records solved a subtle data quality issue...

I can’t remember not knowing how to count. It’s something Count von Count teaches preschoolers on Sesame Street. The Count loves to count, and I seem to share his affinity for doing so. In particular, I like to count records as a means of ensuring data quality.

I’m always asking myself: what is the count of these records and does that count make sense? Sometimes you know exactly what the record count should be: there are 7 days in a week and 12 months in a year, so if you do a count of distinct days and months and don’t get these numbers, then something almost certainly is amiss. Other times you only know a ballpark figure: Wikipedia says the population of Arizona was 7,278,717 in 2019, so if you’re given a database table purporting to contain one record for every state resident and the count is not close to this figure, something is amiss. Other times you only know a relationship between counts: unless there is a gender bias in the data, the count of males and females for a given population should be close to 50/50.

Counting, a skill that preschoolers learned, recently helped me find a very subtle bug: a missing state from a list of U.S. states and territories in a dropdown menu. Here’s how it unfolded.

Is this partial list of U.S. states and territories correct?

I had the task of producing a list of enumerated states and territories for a dropdown menu in a new website form. Not wanting to reinvent the wheel, I searched our code repositories for an existing form that was already tested and in production, which had an enumerated, alphabetized list of states and territories. I sent a copy of the list off to the business team to see if they wanted any changes made to the list. They did: they wanted to remove some of the more far flung U.S. territories. They sent me back the territories to remove.

Fortunately, the original list fortunately was organized as a JSON array with one state or territory per line. I popped it into Visual Studio Code, which displays line numbers by default. I removed the unwanted territories from the list. Following my own advice (and habit) to look at the data, I then spent a little time doing so.

I noticed that most of the territories had been removed, leaving the District of Columbia (D.C.) and Puerto Rico (P.R.) as the only non-states in the list. It was pretty easy to see this by quickly scrolling through the list. Now my counting affinity kicked in. There should be a total of 52 items in the list: 50 states plus D.C. and P.R. Line 1 was the opening square bracket [. Line 2 was “Alabama”. Scrolling to the bottom, I saw that “Wyoming” appeared on line 52.

Now my brain started playing tricks on me. The end of the list was on line 52, so maybe everything was good; but because of the square bracket on line 1, there weren’t 52 data elements in the list. The first data element was on line 2, and the last was on line 52. That would seem to indicate that there were 50 items in the list, which of course is the number of states, so maybe everything was good; but I knew there were two non-states, and also that you had to subtract line numbers and add one to determine an inclusive count of the total number of items. There were 51 items in the list, even though there should be 52 and the last was indeed on line 52. I had to overcome the tricks my mind was playing to realize that there was a problem.

But what was the problem and where did it come from? What was missing? It must be a state, because I could clearly see the District of Columbia and Puerto Rico in the list. I started examining the states in detail, but I couldn’t see the missing one; it wasn’t something obvious like “Arizona”, “California”, or “New York”.

My next step was to go to that wonderful repository of collective knowledge, the Internet, and find an alphabetized list of the 50 U.S. states. That was an easy search, and I cut-and-pasted the list into a new VS Code tab, with a blank line at line 1. I then temporarily deleted District of Columbia and Puerto Rico from my list, so that I could compare my list side-by-side with the list from the Internet. Then it was easy to find the missing state.

Had I mistakenly deleted the missing state when creating my list, or did the problem go further back in time? I examined my original source, the list of states and territories from a production repo: it was missing that state. The missing state was such a hard problem to find visually that no one had noticed it either in previous test cycles or production. No one had complained about it.

It was an easy fix, once the problem had been found. Record counting was the key to finding the problem. This is only the most recent example in my career of how performing a simple count has helped improve data quality.