My most challenging field failure ever, part two

posted: January 6, 2018

tl;dr: I wouldn’t believe this story if it hadn’t actually happened to me...

(continued from part one):

...where, fortunately, he was able to get the police to believe his story after they checked a few facts that he gave them.

Initially we thought that the software upgrade might have solved the problem, but we soon saw the problem happen at some sites that had received the upgraded software. We were trying other ways to troubleshoot this systems issue. We came up with an in-line monitoring tool (basically, a packet sniffer, which is an incredibly useful tool when debugging networking issues) and wrote some code to create a logger. We installed this at a small number of sites; of course (as per Murphy’s Law) the sites with the long-term monitoring in place never went down.

We (the engineers working the issue with me) were pretty convinced that if we could just see the issue happen once with the monitoring tool in place, which captured all the communication between our product and the base station, we would know what the trigger for the problem was. Our customer, the wireless ISP, was starting to share more information with us. We got access to some log files from some base stations that had crashed: the logs didn’t contain evidence of the exact trigger, but they definitely captured the base station going through a prolonged operating system shutdown sequence in response to some trigger. Something smelled funny; I still couldn’t imagine any sort of packet that we could send to the base station which would cause it to initiate an entire operating system shutdown. But I’ve seen some weird things in my career; anything is possible.

The wireless ISP had some base stations in one of their labs, and they let us make a couple of trips out there to try to recreate the problem in the lab, with our monitoring tool in place. We were unable to recreate it. One of the trips was a solo trip by me, and I remember a harrowing drive in the dark on a nearly deserted road to get to the Detroit airport for my return flight home: it was so cold that the rental car’s windshield wipers and washers were frozen, and I had to pull into an empty gas station not for gas but to find some way to clean the windshield so that I could see.

Because the customer had log files of the base station crashing, they were starting to lean on the base station vendor, a well-known multinational firm headquartered in Europe but with facilities around the world. Of course (Murphy’s Law again) the lab where they did their base station testing was up in Canada, where it was even colder. After some prodding from the customer, we were allowed to go into the base station vendor’s lab to do some joint testing, to attempt to recreate the problem. Fortunately the base station vendor’s lab personnel were very friendly towards us, even though this whole situation was a pain to them and, from their perspective, there was already a solution in place: just tell the wireless ISP not to use my company’s product and to use the other vendor’s product which worked. Finally, in the lab in Canada, after many attempts, we were able to recreate the problem one time, with monitoring in place and logging on both ends.

The logs appeared to clear my company’s product of any wrongdoing. We saw the base station send a command to the product and the product produced a properly-formatted response, but then soon after that the base station went through its shutdown sequence. The base station vendor’s lab personnel summoned the base station software team, who started investigating the base station software in earnest. We left them some equipment so they could continue to test on their own. The senior software engineer and I, who were working the issue most intensely, returned home in triumph. I felt like Bill Murray’s character Venkman in Ghostbusters, with the software engineer playing the role of Dan Akroyd’s character Ray after they had captured a nasty bug and proclaimed:

We came, we saw, we kicked its ass!

A few weeks later the base station vendor produced a new software release which solved the problem in the lab. Eventually the new software release was rolled out to the wireless ISP’s network, and the field issues went away. With the root cause now known, we could piece together the entire chain of events that led to the field issue. It still boggles my mind to walk through this series of unfortunate events, but here goes:

The wireless ISP customer requested a new feature in the product: they wanted the ability to adjust a parameter in the product in the field that up until then had been a fixed value. This was a new hardware feature but it also meant implementing a new command from the base station, to allow the base station to set this parameter. The customer challenged both of their two product vendors (my company and our competitor) to deliver this feature. My company delivered it first, so we scored one small point for being on the leading edge, and probably 10,000 negative points for the issue that erupted because of it, as the reason the other vendor’s product didn’t have the field issue (yet) was that they hadn’t implemented the feature (yet). The feature required a small software enhancement in the base station, to detect devices that had this new feature (based on vendor version number) and then send a command to set the parameter. The base station software release with the enhancement had started to be rolled out towards the end of December, which accounted for the timing of the field issue.

Now here’s the silliest thing: the product shipped with a default value for this parameter which was the same as the previously hard-coded value. If the base station had never issued the new command to set this value, it would have continued to be the default value. But the base station did issue a command telling the product to set the parameter to...the default value that was already in use! The customer was not actually using the new feature yet!

This command didn’t cause any problem at all with the product. But the implementation of that command in the base station was faulty. We were told that it had something to do with a rounding error (or perhaps a type conversion between integer and floating-point values) that sometimes produced an unhandled divide-by-zero exception, which eventually led to an operating system shutdown sequence. The base station vendor had considered this to be a minor feature, so they apparently didn’t bother any of their high-cost development teams in Europe or North America with implementing it. In the spirit of global development they assigned the feature to one of their low-cost teams in China to implement. I have no direct evidence but I envision it having been coded by a new junior developer without a lot of formal training who was assigned one tiny feature within a massive monolithic software codebase. They probably didn’t come up with any good test cases for this feature; they certainly couldn’t do true end-to-end testing until our product existed, and they didn’t have one until we delivered several to their Canadian lab well after the issue erupted in the field.

So that is how a developer in China could make a small coding error which led to a natural gas pipeline getting shut down months later halfway around the world during one of the coldest winters on record, as well as a myriad of other problems. It also explains how an obvious weather-induced hardware failure in one product could morph into a software issue in a separate product. This was the most interesting “series of unfortunate events” in my career (so far), but you see situations like this happen all the time in the industry, especially with bugs in minimally-resourced open source packages that are relied upon by many (e.g. Heartbleed).

Lessons learned/reinforced:

As always, follow the advice of The Hitchhiker’s Guide to the Galaxy: Don’t Panic
Many non-technical people haven’t read The Hitchhiker’s Guide to the Galaxy and hence they are predisposed to panic 😉
Be careful of cramming in too many features; there’s always a risk and the possibility of unintended consequences
Don’t leap to conclusions, don’t assign blame
Look at the problem holistically
End-to-end interoperability testing is important
It doesn’t matter who created the problem; what matters is fixing it

and finally, the most important:

Never check email on a powder day 😉