Storing database records that lack a unique ID

posted: March 10, 2018

tl;dr: The unique ID is to a database record as the soul is to a person...

It is all to easy to take unique IDs, those somewhat long strings of digits and occasionally letters which identify a record in a database, for granted. When programming a system from scratch it is pretty easy to create and use unique IDs. You typically know when a new record is being created, such as when a user fills in a form and clicks “submit”, and there are several techniques for generating the unique ID itself. If using sequential numeric IDs the database can be queried for the current maximum ID value, which can then be incremented. Another technique is to generate a Universally Unique ID (UUID), and there are libraries in many programming languages to easily do so.

Once generated, the unique ID becomes the “key” to the record, in multiple senses of the word. The key is metadata that allows a new copy of the record to be stored with new data in any or all other fields: the old copy of the record is found by the unique ID key and is replaced with the new copy of the record. Records can easily be linked or grouped to other records by referencing the unique ID keys. Keys can be indexed for fast record retrieval by unique ID. When databases need to be combined into a database-of-databases, it’s easy to deal with the potentially overlapping unique IDs by just prefixing the unique ID in the database-of-databases with some additional characters that identify the database from which the record (and the unique ID) came.

Life is good when records have a unique ID

I pretty much took all this for granted until a recent project which required loading records about people which lacked any sort of unique ID. Actually the records contained what appeared to be a unique ID, but it turned out that the same person could have different IDs across multiple records. Perhaps the original system which generated the IDs generated new IDs for the same person at different points in times for different transactions or time periods, or perhaps the IDs came from multiple downstream databases, or perhaps it was just bad programming or database management; it was unclear what caused the issue. The bottom line was that the IDs could not be trusted, and the historical data could not be fixed. One of the developers I work with said, half-jokingly but also insightfully, that we should walk away from the project.

Some people misunderstand this problem and think it is merely a matter of generating a sequential number or a UUID and adding it to the incoming record. Yes, it is trivially simple to generate an ID number, but the challenge is knowing when to generate it. Many of the incoming records will be new records for the same person, not new persons. If every incoming record is assigned a new ID, then there will be many duplicate records for the same person. The challenge is determining when an incoming record is a new version of a person’s record that already exists in the database and when it is a brand new record for a person that is not in the database already. In the former case the existing record (with the existing unique ID) needs to be located and updated; in the latter case a new unique ID needs to be generated and a new record stored.

There’s no easy solution to this problem, and no perfect automated solution to this problem, unless there is a set of data fields that, when combined, is entirely unique (there can only be one record that contains that set of values) and consistent (every record that comes into the system for that entity has the same set of values).

Consider the challenge of records about people, which contain biographical data such as name, birthdate, gender, home address, phone numbers, email address, height, weight, etc. It is possible for pretty much any field of data about a person to change in the future. People’s names often change when they get married or divorced. Some people change their first and/or last names to adapt to the style of a particular culture. People move all the time, and get new phone numbers, and switch email addresses. Birthdates can change, if people decide to fudge in the future on their birthdate or if they correct an error in the previously stored value. People change their gender; the male Bruce Jenner became the female Caitlyn Marie Jenner. Just as with the human body itself, where almost any organ or limb can be removed or replaced, almost any piece of data about a person can be changed; the soul is the one invariant aspect of a person. In that sense, the unique ID in a database is the soul of a person’s record.

I think the best that can be done is to choose a set of data fields that are nearly unique and relatively invariant and use those fields to look up every incoming record to see if already exists in the database. Note: this lookup process takes a lot more time than looking up a record based on a unique ID, because it involves querying the database for records which match certain field values and then analyzing the results. If an existing record with those field values can be found, assume that the incoming record refers to the same entity and use the unique ID already assigned; if it can’t be found, generate a new unique ID and store it as a new record. One more step is desirable: provide a way for the user of the system to manually merge records, if that user can utilize some additional knowledge to determine that two records in the database actually refer to the same entity.

The exact set of fields chosen to use in this process will depend upon the fields which are available, how often they actually are populated with data, how accurate that data is, and how invariant the data is. Some fields are tricky: addresses may be entered in different formats across multiple records (e.g. “123 S. Main St.” versus “123 South Main Street”); names may appear one time with a formal name and the next time with a nickname or synonym (e.g. “Katherine” versus “Kathy”). Fail to take this into account and you will end up with more duplicate records than are desirable, which increases the burden on the user to manually merge records.

If you are faced with the challenge of loading records which lack a unique ID, you are in for quite a bit of data analysis, database querying, parsing, tokenizing, optimization, and user interface work. Good luck!

Related post: Sequential IDs vs. UUIDs

Related post: Aliases vs. IDs