posted: July 23, 2017
tl;dr: Sometimes the field failure really is the customer’s fault...
“Yes, I understand the importance. I’ll send someone on site ASAP,” I replied to my boss, the CEO, as I held the phone handset a foot away from my head to attenuate the voice coming through the earpiece such that it would no longer induce pain.
He was livid, and taking it out on me, the VP Engineering, because the hot new networking product my engineers and I had developed and just released to market was exhibiting strange field failures at an important customer site, the Tennessee Valley Authority (TVA). In addition to generating power, the TVA also had a large broadband network, and was experimenting with our newest product to potentially upgrade the speed and multi-service capabilities (data, video, voice) of their network. This new product was the most sophisticated product our company had ever developed, and we were betting a good portion of the future growth of the company on this product. It was critical that it work, and that it not develop a bad reputation with customers and salespeople.
When the failure reports from TVA first started trickling in, they were atypical. The product was a chassis-based system with a midplane design, circuit board modules that could plug in from the front and rear (mating via pins exposed on connectors), and a lot of custom hardware and especially software running on the circuit boards to do all the networking and data conversion. The failures weren’t the usual types of failures you see in systems of this type: features not working, certain configurations not working, functionality “freezing”, or the entire system resetting. Instead, it was hardware failures: the TVA folks were sending back circuit boards that had burnt-out components. Of the returns we got, it was never the same component on the same type of board twice; each returned board exhibited a different burnt-out section of circuitry. They were the only customer that had returned any burnt-out boards.
We had done extensive environmental testing across temperature and voltage ranges before we had released the product. We tried more strenuous tests in the lab; we couldn’t get a board to just simply burn out. We double-checked our manufacturing processes and pulled other boards for inspection; nothing seemed amiss. Finally, after several more returns, with the CEO getting madder and madder each time as I was unable to solve the problem, I agreed to send our top engineer on site. Maybe, because it was the TVA and the equipment was installed near huge generators at a power plant, there was something strange about the grounding or the electromagnetic fields in the vicinity.
The engineer showed up with a bunch of test equipment and instrumented one of our products installed at the TVA, to continuously monitor electrical and environmental parameters. He was there for several days, and each of the first two nights he was there, at around 2am in the morning, there was a failure with a board getting “fried”. Absolutely nothing seemed amiss in the measurements. Now the reputation of myself and my team had gone from bad to worse: not only had we not solved the problem, but our top engineer was on-site and the problem was still occurring right under his nose, albeit while he was sleeping but still monitoring the site with test equipment.
After one more night’s failure, when our engineer got on-site the next morning, he was pulled aside by a TVA engineer on the project and told that the problem had been solved, and that he could pack up his equipment and go home.
The TVA engineer’s story: he had had a suspicion and to test his theory, he had pulled a hair from his head and placed it between the front panel and the frame of our product, such that if the front panel were opened, the hair would fall out. That’s exactly what he saw when he arrived at work the next morning. So that last night he set up a videocamera aimed at our product. What he saw when he looked at the tape was that at around 2am, a couple of older TVA workers had come over to our product, opened it up, and dragged the blade of a screwdriver across some exposed pins on the midplane, which caused electrical shorts and soon blow up a piece of circuitry on a board.
As explained by the TVA engineer, these two older TVA workers were nearing retirement after working their whole careers at the TVA. They were less than thrilled about the new TVA trial network that was being prototyped with my company’s product. They didn’t want to have to learn a new product, network, and technology, and they perhaps saw a threat to their jobs. So they were sabotaging the trial.