My most challenging field failure ever, part one

posted: December 27, 2017

tl;dr: I wouldn’t believe this story if it hadn’t actually happened to me...

This story took place several years ago, during the first of two consecutive extraordinarily cold polar vortex Midwest winters.

It was a beautiful bluebird morning after around six inches of fresh powder had fallen overnight, enough to make it feel as though I was carving through clouds at my favorite ski resort, Steamboat. I was there for a week over the winter holidays with my family. Everyone else was sleeping in that morning, while I had caught first chair and headed for a remote part of the ski resort to take maximum advantage of the conditions. It was the best morning of that entire trip. After around a hour and a half of blissful powder skiing, I headed into one of the mountain lodges to take my first break of the day. While I was warming up with a coffee, I pulled out my iPhone to see what was happening in the world outside of the powder zone.

Powder mornings are the best mornings

I had six emails that morning from work, which was highly unusual as we were officially shut down for the holiday break. Our customers, who were primarily large Internet Service Providers (ISPs), also had most of their personnel on break, which was why it was usually a safe time for me to escape for a vacation. The emails started out bad and got worse, and each time more people were being added to the cc: list until the entire executive staff of the company was being copied.

One of our customers, a major wireless ISP, was experiencing outages in their network, and they were blaming one of our products. This particular product was installed at a number of their cell sites in the upper Midwest, and the nature of the product was such that it was installed at the top of the cell towers. It was a sophisticated radio frequency (RF) device that did some conditioning of the wireless signals, right next to the actual antennas, to maximize performance. Although primarily an RF device, it did have an embedded microcontroller and a small amount of software, to respond to commands from the base to turn various features on and off, and to report back some performance statistics. The customer was claiming that our product was causing their wireless base stations to fail, in some cases rebooting and in other cases going offline and staying offline until a field technician showed up onsite to manually bring them back up (which was a huge problem, as these sites were scattered across some very rural territory, and it took hours to drive to some of them). The customer had deployed similar products from two vendors in their network, and they were only experiencing failures at sites where my company’s product was installed; the other vendor’s product worked fine.

On the initial emails, people were speculating that the problem had something to do with the weather. It was very cold that winter, the sites were in an even colder part of the Midwest than Chicagoland, and the product in the field was perched at the top of cell towers, totally exposed to the wind, ice, and snow. We also weren’t hearing reports of problems in other geographies. I was a bit skeptical of this theory. Yes it was cold, but the product was rated to operate down to an even lower temperature than was being experienced in the field. Also, it had turned really cold at the beginning of December, and the problem didn’t start happening until the holidays.

More importantly, I knew we had actually tested the product at extreme temperatures: we had several large temperature chambers where we tested all our products beyond their ratings. I had seen this testing being done, many times, and I knew we were meticulous about it. But others, who had never seen the testing or the equipment, were doubtful. A few accusations were hurled that we in Product Development / Engineering must have forgotten to do this testing for these products. Another slightly more plausible theory was that something had changed in the manufacturing of the product (perhaps a new source of material, such as a gasket seal) that was more susceptible to temperature problems. So on Day One of the problem, as I madly responded to emails by typing away on my iPhone for nearly two hours from the mid-mountain lodge, we decided to pull some newly manufactured units from production and rerun the temperature tests.

As I expected they would, all tests passed. My great ski morning was but a memory, and now I had a crisis on my hands, as the company’s Senior V.P. of Product. This product hadn’t been designed by me, nor had it been designed by engineers that I directly managed, as it came to my company through an acquisition. But as the ultimate head of product development, it was my responsibility to fix it.

A bad thing happens in a product quality crisis: people outside the product team lose faith in the product and the people who designed it. I suppose this is a natural reaction, but it is surprising how fast it happens and to what degree. Yes, this was a very serious problem for the company: we actually were legally on the hook to replace faulty product in the field by climbing cell towers, and no one wanted to have to hire technicians to do that, if it were even possible to do so in the middle of a polar vortex winter (every year technicians die climbing cell towers - it is a dangerous job even in the best weather conditions). Everybody from the CEO on down was upset and made sure I knew so; all sorts of worst-case scenarios about what could happen to the company were being told to me; and a lot of anger was being aimed at the team that had originally designed the product. It was a classic case of the short rhyme: “When in trouble or in doubt, run in circles, scream and shout.” None of this actually helps. What helps is to remain calm and to approach the situation like any other type of problem that needs to be solved. Until we got to the root cause of the problem, I wasn’t going to leap to conclusions about who should be fired or any other such corrective actions being discussed.

This was a challenging problem to solve because my company did not have access to the other critical piece of equipment that our product interoperated with at the cell site, namely a base station. A base station is the main piece of RF equipment that transmits and receives the radio signals, and performs all the protocol conversions to turn the signals into the packet streams of data and voice. Base stations are expensive, and the companies that produce them are large multibillion dollar multinational companies who sell their basestations to wireless ISPs, not to small companies building add-on equipment, like the company I worked for.

The temperature testing and some further testing of power and signal variations pretty much eliminated environmental factors. We were able to get the customer to send back some of our product that had not yet been installed, and testing of that product eliminated the possibility of some sort of batch issue causing the field problem. Collectively this testing was eliminating the possibility of a hardware issue causing the problem, which was causing me to shift my attention over to software, even though many of the “run in circles” people were still hung up on how cold it was that winter. I'm sure that some of them thought I was crazy, or trying to deflect blame for the issue.

Our product did have a small amount of software, and it did communicate with the base station. However, the protocol was blindingly simple, a pure master-slave command-response system, where the base station was the master and sent commands, and my company’s product was the slave and sat there listening for commands, sending responses only after being sent a command. Even if the product was sending a malformed response, one would expect a certain amount of robustness in the base station software to be able to handle a bad response. But some of the other clues made it sound like some sort of system-level software problem. When the base stations went down, sometimes they came back online after a reboot cycle, and sometimes they didn’t. Large computer systems can behave like this, and base stations are effectively large multi-board computer systems; they have operating systems and redundancy and subsystems and multiple user as well as programmatic interfaces, and probably millions of lines of code. I suspected that somehow our product was triggering something bad in the base station software, but I couldn’t figure out how.

We concentrated on the communication protocol, and reviewed every command and response. We found a few small non-idealities in our implementation that could be improved upon (this is almost always the case in software development). So we produced a new software release for our product that would make our protocol implementation even more aligned with the specification (which itself was somewhat subject to interpretation, as this particular rarely-used protocol is very poorly defined).

Now the challenge was to get the new software release installed in the problematic network, which was still exhibiting a handful of intermittent failures every day, several weeks after my fateful ski trip. The product’s software was field upgradeable but these products are not on the Internet or any other network. Fortunately, it wasn’t necessary to climb the towers to upgrade the software. The product’s software could be upgraded by visiting each site, momentarily detaching the cable at the base station connected to the product at the tower top, “injecting” new software into the product via the cable and a special device, and then restoring things to their normal state. My company told the customer we would send a team of technicians to drive out to every rural cell site in this network and perform the software upgrade free of charge. The customer agreed, and gave our field techs the access keys and codes needed to get into the cell sites.

This was not fun work for the techs. It was still the middle of a polar vortex winter, and in many cases they were driving hours on snowy back country roads in the upper Midwest, trudging across the frozen tundra to lonely cell sites, trying to get to as many sites as possible each day. Most of the software updates went well, but one in particular didn’t, as I later learned from others (so I can’t vouch 100% for the accuracy of all of this):

It was after dark, and one of the techs was trying to get to one more remote cell site that day. He got to the site, but the access code (or keys) didn’t work and he wasn’t able to get through the gate. Rather than drive hours more the next day to get the right codes/keys and try again, he climbed the fence and jumped down into the cell site. He landed on a pipe that was hidden in the snowy ground, breaking off a valve handle and injuring himself. As it turns out, this valve and pipe were a major control for the natural gas pipeline feeding a nearby town. The break caused the gas pipeline to shut down, which meant that the town had no natural gas to heat homes. The authorities soon noticed and sent police out to the cell site, where they found the tech on the ground, writhing in pain. They hauled him into the police station...

(continued...)