Data quality matters

posted: June 15, 2018

tl;dr: The old “garbage in, garbage out” maxim has never been more true than in this era of Big Data...

In my role at Uprising Technology I see a lot of data from many sources: clients, third parties, the Web, and the Internet. Before we can determine whether a new source of data can deliver the benefits the client is seeking, I am fond of saying “step one is we need to look at the data”. Data quality is highly variable, and poor quality data can lead to many undesirable outcomes.

To cite a non-Uprising example: the high school my kids attended had an online grading system that parents could log into and, in theory, see a real-time snapshot of their kids’ grades at any point during the semester, instead of waiting for report cards to be issued. The promised benefit sounded great: parents could get early indications of their students struggling, and could help teachers correct student behavior. In reality the system was a disaster due to erroneous data. It fundamentally relied upon the teachers entering accurate grades in real-time, which is the exact opposite of the way that most teachers actually do grading: they tend to procrastinate and not do it until the moment final grades are due.

I’d log into the grading system a month into the semester and see that my child had incompletes in two subjects, was failing three others, had Ds in two, and only one A (which was usually music, which didn’t really have any assignments or tests). The incompletes were due to the teacher not entering any data at all about possible or actual point values for any assignments or tests. The Fs and Ds were usually where the teacher had entered the possible point values for some graded items (e.g. “Chapter 1 worksheet is worth 10 points”, “First test is worth 50 points”) but hadn’t entered the points my student had earned, so the system assumed “0”. I’d ask my student about the F or D, and almost always the response was “I did the assignment (or test) but the teacher hasn’t entered the grades yet”. After several iterations of this I gave up and never logged into the system again. It’s a worthless IT investment, due to bad data and a human factors design that fundamentally is in conflict with the way that teachers actually behave.

I've seen things...(warning: spoiler alert)

When I actually looks at the data that a system relies upon, I often am reminded of Roy Batty’s final monologue from my ATF movie Blade Runner. I’ve seen plenty of things in data that people wouldn’t believe. I’ve seen people visiting a client institution a day ago who were born in the early 1800s, and others who have already attended an event in the 22nd century (so perhaps time travel and immortality have already been achieved and we just don’t realize this yet). I’ve seen lots of people with the name “NULL”. I haven’t yet seen Little Bobby Tables, but that is probably only because Little Bobby Tables causes database tables to be wiped out.

To be fair, data is complex, and doesn’t often fit nicely into the two dimensional tables, with neatly structured rows and columns, that constitute so many databases, whether they be Excel files or SQL databases. Names, for example, are more challenging than most people realize.

People often have a formal first name, such as “Christopher”, and a less formal first name, such as “Chris”. Sometimes there are multiple informal first names, such as “Bella” and “Izzie” for “Isabella”. Sometimes the less formal first name is derived from the formal first name, as in the “Chris” example, and sometimes it is something generic and completely different, such as “Buddy” or “Sonny”. Some people don’t like their first name and hence will just enter an initial, or perhaps use their middle name or a nickname as their first name. Some people have one middle name, and some people have multiple.

Last names change often, due to marriage or divorce. Sometimes people will hyphenate multiple last names together, and sometimes they will list them separately, or occasionally provide only one. Some people come from cultures where the family name is stated first. Some people from other cultures rename themselves upon coming to a new country such as the United States, so they may have a legal name which is very different from the name that they use when filling out forms on a website.

With all these situations in play it’s no surprise that the FIRST_NAME, MIDDLE_NAME, and LAST_NAME columns in a database often contain incorrect or inconsistent or incomplete data. Even GENDER is a tricky one, as in our culture it is morphing into gender identity and it’s not necessarily binary. People can also change their gender, along with their name. Bruce Jenner, male, becoming Caitlyn Jenner, female, is a situation that presents a challenge to every system which has a database containing information on the former Bruce Jenner.

Duplicate records are another major challenge for databases, as it is usually all too easy for the users of a system to create multiple records representing the same thing, such as a person. At Uprising we built an Identity Resolver to use information about a person to match up a person’s record in one database with that same person in a completely different database. It works well, but an interesting situation cropped up when we first started using it. When attempting to use information about a person in one database to find that person in the second database, we’d occasionally get multiple results from the second database, because there were multiple records for the same person in the second database (and probably the first one as well). We realized that we could use the Identity Resolver to do database de-duplication: grab every record from a database, and find all the records in the same database with similar identity information.

There are ways to combat some of these data quality issues, but any system is going to have some amount of data entry errors. Values can be checked upon entry to make sure they are within proper ranges. Users can be given enumerated values to choose from, rather than entering text freeform. Addresses, emails, and phone numbers can be standardized and checked against other databases to see if they are valid. But even with the best possible data quality enforcement mechanisms, achieving 100% data perfection is not going to happen, as the world is more complex than the model built from it and stored in a database.