posted: September 13, 2020
tl;dr: The first in a series of posts on failures in computer systems and software...
Early in my career, back when I was a young, hotshot developer, I joined a company whose lead Quality Assurance (QA) engineer was a good 25 to 30 years my senior. He had a somewhat fatalistic view of the systems he was assigned to test, a viewpoint that had been built up over decades of industry experience. I eventually took the first system I developed for the company into his lab for him to test, and was probably acting a bit conceited about its quality. He quickly put me in my place by telling me: “It’s not if it will break, it’s when it will break.”
Now I am that grizzled veteran with decades of experience, and I hold a similar attitude. One of the things that you gain with experience is first-hand knowledge of the many things that can go wrong when building even a slightly complex system. This is one of the reasons why I am a fan of elegant simplicity in design. It’s also why I thought I should enumerate some of the things that can possibly go wrong in various systems, starting with hardware.
People with only a cursory exposure to computers often expect them to be perfect, in terms of always producing the exact same output given the same input. Systems that exhibit this behavior are called deterministic. Very simple computer programs, or small pieces of large computer programs, can be deterministic. Most of the assignments in lower-level computer science classes (e.g. “calculate the prime numbers between 1 and 100”) are to write deterministic programs, so even people who have had some exposure to programming may think that this is the way software should always behave. Alas, there are many things that can and do go wrong.
Software can be run in the cloud, but ultimately it has to run on real hardware, i.e. some computer somewhere, whether that be a programmer’s own laptop or a server in an AWS data center. All computers have finite resources. There is an upper bound on the amount of all kinds of memory: cache of various levels, general purpose RAM, nonvolatile flash memory. Processors have finite numbers of core CPUs, and run at less than infinitely fast speeds. Disk storage has finite capacity. Network interfaces can only transfer data up to a maximum speed. All of these finite capacities can ultimately affect overall performance, and can cause errors when software hits the physical limits imposed by the hardware.
Occasionally hardware fails, and can cause even a deterministic program (such as the 1-to-100 prime number generator) to fail. Power supplies, because of the heat they dissipate and the components they are constructed from, are the most failure-prone elements of computer systems. Sometimes they fail catastrophically, causing the computer to die. Or they can start to become “noisy”, in terms of the higher frequencies electrical signals that they pass into the computer circuitry, which can cause that circuitry to exhibit brief, momentary failures, such as flipping the state of a bit or computing an incorrect result for an instruction. This momentary “glitch” might allow the program to run to completion but produce an incorrect result. Another possible source of hardware-induced failures is external stimuli such as cosmic rays.
Hardware failures are very rare; it might take years before a person’s laptop exhibits one. But given a large enough number of computers, such as the tens or hundreds of thousands of servers in an AWS data center, hardware failures will happen every day. The software which runs the data centers can detect gross failures, take servers out of service, and redistribute jobs elsewhere onto good servers. However, this monitoring software can’t instantaneously detect all of the most minor failures, so output errors can and do happen.
The more servers that a program requires to run, the more likely this problem is. Large simulations that take hours or days to run, and which require hundreds or thousands of servers, can produce different results across different runs. This is why the specialized supercomputers that are often used for large simulations will build in some hardware redundancy, such as running each instruction on two or more independent processors, to see if the same result is produced by each. Simulations run on traditional commercial-grade cloud data centers may have to be run repeatedly, to make sure the results produced each time are consistent.
For most commercial-grade software applications, hardware reliability issues are not a major concern. A more likely source of problems is the software itself and the other software running on the computer or server, which I discuss in What can possibly go wrong?: Software.