What can possibly go wrong?: Videoconferencing

posted: October 18, 2020

tl;dr: Others are upset when their videoconference has issues and glitches; I’m surprised it works as well as it does...

The fourth in a series of posts on failures in computer systems and software

Early in the COVID-19 pandemic, my family started doing weekly (now biweekly) videoconferences to keep in touch, since we were not going to be seeing each other in person for quite a while. One of my relatives, who has a PhD (albeit not in Computer Science), expressed his dismay at the quality of the various videoconferencing services we collectively sampled, and wondered why there was not a free, high quality, open source videoconferencing tool that we could use instead. Maybe he was envisioning some sort of Libre Office for videoconferencing. I shuddered in amazement at what he was saying.

Videoconferencing has got to be one of the toughest mass-market consumer applications and services to do well. There are some huge inherent challenges in solving this problem, which have stumped large teams of the best-and-brightest developers in the business.

I think back to when I first started using laptop- and mobile device-based videoconferencing regularly, in the fall of 2015 when I joined Uprising Technology, which was housed in the 1871 incubator center in Chicago. We initially used Google Hangouts, Google’s free consumer videoconferencing service, because free was the perfect price for a bootstrapped startup company. The quality was terrible: audio was regularly distorted, video froze, participants and calls were dropped. Some of this was exasperated by the spotty Wi-Fi coverage at 1871, which was housed in a very old building with lots of barriers, the Merchandise Mart, and which fed into a Comcast “business” internet access connection. It became so bad that Uprising’s CEO would stay home if she needed to have a better quality videoconference for an important call with clients or a potential investor. We ended up switching to Zoom’s paid videoconferencing service, and leaving 1871 for WeWork facilities that could provide better Internet service. Our videoconferencing experience improved, but there were still issues.

Videoconferencing is a real-time, low-latency application, unlike many other applications that people use. A real-time application must run exactly as fast as time itself, no faster or slower. It doesn’t do any good to play another participant’s video and audio stream at double speed, although some people listen to podcasts at double speed, and it is clearly bad to slow down a participant’s video and audio stream to half speed. When you are in a videoconference and a participant's video or audio speeds up or slows down, it is because the videoconferencing application is trying to recover from a situation that has caused that participant’s stream to get out-of-sync with the actual progression of time. As with many of the challenges I’m describing here, if you know what the challenges are, you can actually see them happening during a typical videoconference.

How does this even work?

There are real-time operating systems that can prioritize tasks and help ensure that an application is more likely to run in real-time, and not gain or lose time like a bad clock. VxWorks is real-time operating system that I’ve used earlier in my career, and which has been used by NASA on various spacecraft. The computer and device operating systems used by consumers (Windows, MacOS, iOS, Android) are not real-time operating systems, but they are multitasking operating systems that allow other tasks and applications to run concurrently, which is not a good thing for the videoconferencing application. By the way, that list of operating systems presents another challenge for a company attempting to develop a videoconferencing service: a native application will need to be written for each operating system plus the Web (to run inside a browser), to maximize the addressable market. Apple, the most valuable company on the planet, has shied away from this challenge: their videoconferencing application, Facetime, runs only on their own operating systems, MacOS and iOS.

Another task running on the computer or device can easily interfere with the performance of the videoconferencing application. This is especially true on older devices with underpowered Central Processing Units (CPUs). Videoconferencing applications are the most CPU-intensive applications that most consumers run, because of all the real-time computations that must be done on the audio and video streams. Trying to run other applications concurrently can overburden the CPU, causing the operating system to have to delay and discard work. Once again, to maximize the addressable market, a videoconferencing company should attempt to support the widest range of consumer devices possible, including as many older devices as possible. This puts a premium on writing highly efficient code and using programming languages like C, which yields highly performant code but at a cost of developer productivity and a risk of security problems. Supporting a wide range of devices also grows the testing and support burden for the company.

Latency measures the end-to-end delay, or the amount of time it takes a piece of information to traverse the entire distance between sender and receiver. If you are speaking in person to someone in the same room, the audio latency is just the speed of sound in air multiplied by the distance between the two of you, and the visual latency is the speed of light multiplied by the distance; both are very small, imperceptible to humans.

If you are on a videoconference with that person, however, there is a lot more inherent latency in the system. The audio and visual information are captured by a microphone and camera and digitized. Some buffering takes place at the source, to build up enough data to bother processing. That audio and video data is encoded and compressed, and packetized for transmission. It competes with other packets being sent into the Internet by your device, and eventually is serialized (i.e. turned into a time-based stream of bits) and sent over your device’s Internet access link.

The packet then begins its journey through the Internet, where it is de-serialized and re-serialized (more delay) at each network hop. See What can possibly go wrong?: Networking for descriptions of the many things that can go wrong in the Internet. Dropped packets are the biggest issue; each dropped packet means that a small fraction of a second’s worth of audio and video data never makes it to the other participants. Some of the other problems, such as out-of-order packets, can be partially overcome by buffering packets at the receiver, to give some time for temporarily missing packets to appear. This buffering, of course, adds latency. Receiver-side buffering is the primary way that streaming video services like Netflix overcome networking issues; it doesn’t much matter that the video stream you are watching (a movie or even “live” TV) is five to ten seconds behind what is actually being sent at the source. You can often see buffering happening when you stream the same live TV stream on two devices in your home; often they will be noticeably out-of-sync, because each device has chosen to use a different receive buffer size. But for a videoconference hardly any receive-side buffering can be done; if the total end-to-end latency is as high as a quarter of a second, it can cause the participants on a videoconference to talk over each other.

There are two primary architectures for the videoconferencing service itself. Zoom and other services operate a cloud-based video server each participant’s audio and video stream is sent to the server, which can do most of the processing to determine the audio and video that each participant’s device should display. The primary alternative, which was pioneered by Skype, is peer-to-peer, in which each participant’s audio and video stream is sent to the other participant(s). It is also possible to have a hybrid, which uses both a centralized server and peer-to-peer. When a server is involved, that server adds latency. Even when a server isn’t involved, there is still some additional latency, as each client device will need to do more processing to determine what audio and video to present to the user.

Processing the incoming audio and video packet streams, to determine exactly what to present to the user, is a very CPU intensive task. The audio should be equalized, so that each participant sounds about as loud as all the other participants. Echo cancellation should be done, to prevent the audio from being recaptured by the recipient’s microphone and played back to the speaker. Often echo cancellation fails, or there is an audio feedback loop that results in a loud tone. Ideally, when there are more than two participants, the audio should be mixed together, so that multiple participants can speak simultaneously. If the audio isn’t mixed but is instead switched, in which the service chooses among the active speakers to send just one person’s audio to everyone, the participants on the call can struggle to determine who should be speaking. Switching isn’t so bad for the video streams, since momentarily showing the wrong “speaker” doesn’t bring the conversation to a halt.

Dealing with dropped packets and momentarily degraded (i.e. lower bandwidth) network connections is a major challenge for the videoconferencing software. The software has to decide how to degrade the video and audio quality in the smoothest possible manner so as to provide the best user experience. We’ve all seen the effects of these choices: pixelization, more distorted than usual audio quality, momentary freezes in video and/or audio, or dropped connections.

Videoconferencing software is some of the most sophisticated application software that consumers use. Once you know the challenges this software faces, perhaps you will be impressed, like me, that it works reasonably well, most of the time