posted: September 26, 2018
tl;dr: Idempotence can be the difference between software that might work once and software that works all the time...
Idempotence: the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. (Definition from Wikipedia.)
Many in the software world have forgotten or never learned the term “idempotence”, yet it is important to writing robust software that processes and stores data, particularly Extract-Transform-Load (ETL) database jobs. A software program with idempotence (a.k.a. an idempotent software program) will produce the same output and the same final state regardless of how many times it run, given the same input data. When loading records into a database, if the same set of records is loaded once, or twice, or a hundred times, the final state of the database should be the same, ignoring timestamp metadata such as the exact timestamp each record was written to the database. There shouldn’t be two copies of each record if the job is run twice, or one hundred copies if it is run one hundred times.
It’s all too easy to write code that is not idempotent. Almost always the quickest-and-dirtiest way of writing code to accomplish a given task is to completely ignore idempotence and possible error scenarios, and just write code that assumes the job will run successfully from start to finish the first time, without encountering any sort of error either in the code or in the system as a whole. That is not, however, how software behaves in the real world.
In the real world, all too many problems can occur which cause software programs to fail. Databases can go down, briefly or for an extended period of time. Responses to queries might never be produced, or they might have errors. Network connections between servers and databases might be interrupted. Disks can fill up, halting jobs from completing. Jobs are sometimes manually killed by operators for various reasons including freeing up resources for other higher priority jobs. And, of course, the source code might not be perfect: there might be latent bugs that appear for the first time after a job has been working for quite some time in production, triggered by unexpected data or other unforeseen, untested conditions.
When a software program fails and the failure is noticed, the natural response is to try to run it again with the same data. Idempotent software is easy to run again from the beginning, as the data already processed before the failure will be processed again with no side effects. Rerunning software that is not idempotent is much more difficult; often you’ll have to figure out exactly where the job failed and how to restart the job from that point.
Making idempotent software which writes results to a database is tricky. The challenge is to avoid getting duplicate records from running a job more than once. It may not be possible to generate and use a Universally Unique ID (UUID) for each record’s ID, since if the same record is loaded a second time, a different UUID will be created and the same data with two different IDs on two different records will end up in the database, which is a duplication. Storing database records that lack a unique ID isn’t easy. Another challenge comes when writing to databases which have separate INSERT and UPDATE operations. The first time a record is stored it needs to be INSERTed, but if it is already present, it needs to be UPDATEd. It might be necessary to try reading a record first to see if it already exists, before deciding whether to INSERT or UPDATE it. Database technologies which natively support upsert (update or insert, as needed) operations make writing idempotent software easier.
It might seem simple to make operations which read data idempotent, as that’s the way that most memory technologies such as Flash, Random Access Memory (RAM), and disks behave: once data is successfully written to them, that data can be read back many times, until the device finally fails. Yet there are some memory technologies that can only be read once, and some special registers which may change a counter value each time they are accessed.
Recently I had to deal with an SFTP site that would, for security reasons, only allow a file to be retrieved once; if the file was retrieved but then not successfully processed and stored, there was no way to repeat the operation. Since I didn’t control the SFTP site, the only thing I could do was to minimize the operations done in between retrieving the file and storing a local copy that could, if the job subsequently failed, be used to supply the input data again.
When you do the work to give your code idempotence, you’ll sleep much better. Anytime a job fails in production you can just simply run it again; this can even be automated if desired. Idempotence is more work up front, but it can save lots of work, headaches, and customer frustration in the long run.