What can possibly go wrong?: Networking

posted: October 3, 2020

tl;dr: The third in a series of posts on failures in computer systems and software...

Systems can fail or deliver an incorrect result due to both hardware and software issues, but an especially common culprit is networking issues.

Networking issues are even more prevalent these days because computer systems are more networked than they ever were in the past. Gone are the days of all the compute power used by an application residing on premise in a carefully managed data center, with terminals connected to that compute power via dedicated wires.

These days, the end terminal devices are predominantly wireless smartphones, tablets, and laptops using Wi-Fi or broadband cellular (even in many office environments) to give people the freedom to move around and take their computers with them. More importantly, thanks to the cloud computing revolution, many of the applications being used are running on remote servers in one or more data centers often located hundreds or thousands of miles away, all connected via the Internet and other computer networks. The applications themselves often make use of other back end services located elsewhere, or retrieve data that may come from anywhere on the planet or off.

This highly networked application architecture permits applications to do things that couldn’t be attempted back in the old days. But it also means that applications are more susceptible to failures due to networking issues.

I spent a good fraction of my career in the networking industry, so I’ve experienced firsthand many of the things that can go wrong. Sometimes I shake my head in amazement that the Internet works as well as it does. It was not originally designed to support applications such as real-time streaming video.

It’s accurate to say that the Internet is a “best effort” packet-based network. Various initiatives over the decades to introduce higher levels of “quality of service” to the mass-market Internet have failed. The net neutrality movement, with its prohibition of so-called “fast lanes”, aims to ensure that the Internet forever remains a best effort network. When a packet of information is sent into the Internet, the network will endeavor to make its “best effort” to deliver the packet to its ultimate destination, but there is no guarantee that it will be able to do so. Packets are lost or dropped all the time, for a plethora of reasons.

A full strength Wi-Fi signal does not ensure error-free networking

When a packet traverses any given link in the network, every bit of information in the packet has to successfully be transmitted and received over the physical medium of that link, else the packet is dropped by the receiver. If that physical medium is a wireless link (Wi-Fi or cellular), it will have a higher bit error rate than a wired link (Ethernet or fiber optics). The error rate is not only a matter of the Wireless signal strength. There can be momentary sources of radio interference, such as another nearby transmitter or a microwave oven (the most popular Wi-Fi channels overlap with the radio frequencies emitted by microwave ovens). Radio waves reflect off various surfaces and objects, and as the wireless device moves around, it may momentarily be located in a place where the radio waves cancel each other out and there is effectively little or no signal strength. Most antennas do not emit and receive energy uniformly, so moving the device will impact radio performance. For those old enough to remember using portable radios, and having to orient them to get a better signal, the same situation occurs with Wi-Fi and cellular.

Even with error-free physical media, momentary network congestion is an unavoidable fact-of-life in today’s best effort Internet. Many consumers think that, when they purchase Internet service at a rated speed, they will (or should) always be able to send/receive that amount of traffic into/from the Internet. In reality that speed rating is a peak instantaneous speed only achieved when there is no one else using any of the shared links that the traffic traverses. Nearly every link in the Internet is shared among users; there are no end-to-end bandwidth guarantees.

The average consumer would probably be shocked by a sight that Internet network engineers see all the time, when they measure the instantaneous amount of traffic over given links. Much of the time the link is used at well under half its capacity. But there will be plenty of occasional spikes up to maximum capacity, when a bunch of packets show up at nearly the same time. If these spikes persist, the networking equipment sending the packets over the link will have to drop some packets. In fact, dropping packets is a feature (not a bug) of the Internet. Dropped packets are the “slow down” indicator that eventually makes its way back to the source of the traffic.

There is also no guarantee that packets sent into the Internet will arrive at the destination in the same sequence in which they were sent. There are almost always many paths through the Internet from the source to the destination. Internet routers can and will choose different paths for different packets based on a variety of factors, including instantaneous network congestion. The end result of all these networking issues is that, if you send 100 packets into the Internet, 98 of them might arrive at the destination, and in a somewhat mixed up order.

The Internet has additional protocols to deal with these issues, in an attempt to provide an error-free, loss-free communications link. That’s what the “TCP” (Transmission Control Protocol) in “TCP/IP” does. TCP numbers the packets so that, after some time elapses and a missing packet doesn’t arrive, the destination can tell the source to resend that packet. But TCP cannot handle all possible network problems; it has limits. It also introduces latency (end-to-end delay) and jitter (variability in packet arrival timing), which can negatively impact some applications.

Applications which use the Internet need to be aware of the many things that can go wrong at the networking layer. For a discussion of an application that does so, see What can possibly go wrong?: Videoconferencing