posted: September 19, 2020
tl;dr: The second in a series of posts on failures in computer systems and software...
As I discuss in What can possibly go wrong?: Hardware, hardware glitches and failures can cause systems to fail or to deliver an incorrect result, but the more common culprit is software.
With the exception of simple tasks, it is hard to write a computer program that always produces a correct result under all conditions and given all possible inputs. It’s even hard to know in advance what all those possible conditions and inputs are. Most modern computer applications are fully interactive, with a user (or large numbers of users) performing actions and providing input whenever they choose to do so, which is not always when a program may be expecting to receive it.
The input values provided may not always be within the range that the program is expecting, a problem I wrote about in Cleansing user input data. Also described in that post is the situation where the program itself is able to handle the user’s input, but a downstream service that the program utilizes (a second company’s system which has a SQL database) is unable to properly handle the user’s input.
There are other types of bugs which manifest themselves when given certain inputs. The divide-by-zero problem, in which the program performs a division operation when the divisor happens to have the value of zero, is a classic one. Since the divisor may itself be based on another calculation, or the result of a long-running tabulation or aggregation, it may not be entirely obvious to the programmer that the divisor could ever have a value of zero.
Handling inconsistencies in state information is another challenge. Let’s say a program is processing the results of an email campaign by examining how users have interacted with the emails. In order for a particular user to have been sent the email, the user had to exist in the database when the email was originally sent. But the user’s interactions arrive later in time, and when the program processing those interactions attempts to lookup the user in the database, the user may have been deleted, for whatever reason. It’s common for programmers, when handling complex state information, to put checks in their code to test for unexpected inconsistencies, and to log an error message that says something like “this situation should never happen”. Occasionally, in the real world, those messages do show up in the logs.
An even bigger cause of problems is the fact that nearly every modern computer, whether a developer’s laptop or a server in a cloud service provider’s data center, is a complex software system running layers of software written by many different people and groups of people, with no one entity in charge of the quality of all the software. The actual application itself may constitute just a small fraction of the total number of instructions executed by the computer’s CPU. A problem at any layer of the software may cause the application to fail or produce an incorrect result.
The run-time environment handles some important chores for the application, including high-level memory management. Unbeknownst to the application, the run-time environment’s garbage collector may decide every now and then that it needs to gather up memory previously used by the application, thereby introducing variability into the execution speed of the application. This variability in timing may cause a user-visible problem in the application.
The run-time environment runs on top of the computer’s operating system, often Linux, Windows, or MacOS. Modern operating systems comprise hundreds of megabytes or even gigabytes of software, written by huge teams of programmers over decades. The operating system itself relies upon other close-to-the-hardware software written by chip vendors or the computer hardware vendor. As a result, executing a single instruction in the application can easily invoke a dozen or more different layers of software, all written by different programmers or teams of programmers.
Modern multitasking operating systems are complex beasts. They aren’t just running the application itself; they are also running other tasks in a concurrent fashion. Occasionally problems with these other tasks can negatively impact the application. The operating system also provides and manages the finite resources of the computer system itself, including memory, threads, processes, networking sockets, network interfaces, database connections, and others. Should any of these finite resources be temporarily exhausted, the application may fail.
In the cloud computing era, most applications rely upon other services running on other computers, connected via a network. Networking problems are another source of errors, which I discuss in What can possibly go wrong?: Networking.