posted: July 3, 2021
tl;dr: Systems can run perfectly for months or years, until an unexpected piece of data shows up...
It happens all the time. It happened again, this week. Code which had been in production for well over a year, and which had successfully processed millions of records, suddenly stopped working and threw a cryptic error message.
Since the code hadn’t been changed in a long time, it wasn’t due to a recent code change. It could have been an environment problem, in which the code is no longer able to be executed properly, perhaps because a disk has filled up or credentials have expired. Often, however, problems like this are due to the appearance of some new input data, which has never been seen before and which the code, when originally written, did not anticipate. It could be a new one-in-a-million (or more) “edge case” or “corner case” which the code authors did not take into account and which the testers never bothered to test.
I call this problem the “packet bomb” problem. Earlier in my career I worked in the computer networking industry, on software systems that had to process packets that were sent across networks. A networking product such as a router can be envisioned as an automated package sorting facility, in which the packets are physical packages, with addresses on them, that are sent over conveyor belts, with the router having to push each packet onto the right sequence of conveyors in order to reach its intended destination. When the packages are nicely formatted, as per the router’s expectations, everything works fine. But every once in a while a package shows up that has a bomb inside it, which explodes as the router is examining it. Chaos breaks out, requiring both repair work and improved robustness, to handle this kind of packet bomb in the future without it exploding.
The packet bomb problem is not unique to the networking industry, and the packages do not have to be networking packets. They can be records being retrieved from a database, or web form submissions, or any other stream of input data. As long as the failure is surfaced by something unexpected in the input data, in a system that had been working for a long time, a packet bomb has exploded.
This week’s packet bomb problem was due to data records being retrieved by an application from a PostgreSQL database via a SQL query. The records contained an ip_address column, and the query tried to filter out records with certain IP addresses so that they wouldn’t be processed, leaving only the records that should. The query was over 70 lines long, and here some of the lines that were causing the issue:
WHERE
NOT CAST(ip_address AS inet) << '162.222.72.0/21'
AND NOT CAST(ip_address AS inet) << '66.85.48.0/21'
The CAST(column AS inet) function will take a text column and treat it as an IP address, so that it can be checked against the IPv4 range to the right of the << operator.
The problem, as we discovered this week, is that if the column value is a text string that cannot be converted into an IP address, the CAST fails and the entire query fails, throwing an error that eventually bubbled up to the application that produced the programmatic query. The records in the database are produced by client software that we did not write or control, running on a wide variety of devices, operating systems, and browsers, with whatever browser plugins users had installed. Until recently, the ip_address values were always nicely formatted strings that looked like IP addresses, or NULL, which did not cause an issue. But recently, a few records showed up that had the value: "unknown". Those are the packet bomb records that caused the failure.
The fix is to add an explicit test to only do the CAST if it is likely to succeed:
WHERE
ip_address NOT LIKE '%.%.%.%' OR ip_address LIKE '%.%.%.%'
AND NOT CAST(ip_address AS inet) << '162.222.72.0/21'
AND NOT CAST(ip_address AS inet) << '66.85.48.0/21'
That solved this particular packet bomb problem. Experience has taught me that there will be others.