Consider the sad path

posted: August 21, 2021

tl;dr: It’s one thing to get a program working initially, and a whole other thing to make it deal with everything that can possibly go wrong...

Let’s define the “happy path” as the order of operations that a program performs when given expected input and every external dependency works. The “sad path” then is a path that the program takes when unexpected input shows up, or when an external dependency fails.

Programming the happy path is usually pretty easy. A spec often exists for it, which describes what the program should do to achieve the desired outcome. Especially when user stories exist, the order of operations in the typical case will often be documented. This provides a path for the code to follow. The first test cases (i.e. the ones most likely to be written) will test the happy path. The first users will also initially exercise the happy path.

Typical 403 error page displayed by a browser

The sad path, by contrast, is more unknown. There are so many things that can possibly go wrong, among them:

User authentication fails
Authentication with an external API fails
Input arrives with unexpected values, weird formatting, or large size
User performs operations in an unexpected order
User proceeds down one path, backs up with the system in an interim state, then proceeds down a conflicting path
External API or service is temporarily unavailable
External API or service fails when given certain data
External API or service is enforcing rate-limiting, or cannot process large amounts of data
Disk space or other storage space fills up
Program slows down and fails under load

While some of these can reasonably be expected to happen on occasion, in which case code should be written to handle them, many of these are uncertain as to whether or how often they will actually occur in real-world usage. In particular the failure modes of external APIs and services are often not fully documented by the providers. Sometimes the only way to know what the external API or service does in certain conditions is to experience it. Test code can be written to stress the external service or API, but even this won’t catch everything that can go wrong.

If it took 1X time to write the code for the happy path, it can easily take another 1X to 5X to write the code to handle all the sad path cases that users encounter. It’s a lot easier to assume that every API operation succeeds than to deal with every possible problem that can crop up. Without knowing in advance everything that can go wrong, or how likely a given error actually is, if you do spend 5X time up front adding robustness to the code by dealing with errors, a good portion of that time is likely to be wasted.

Given the time-to-market pressures that most software projects face, the bias is to write code for only the most likely known error conditions, and to wait for others to happen in production. This is effectively coding for the happy path and ignoring most of the sad paths. A thorough QA cycle will test for more of the sad path conditions than those envisioned by the developers, but often the test cycle gets squeezed at the end of the project. Even the best QA cycle won’t catch everything that can go wrong in the real world.

Sometimes, due to business realities, you just have to throw the program out in the wild, see how it breaks, and react quickly. If you do this, make sure your program includes the first feature to put into any software product. Hopefully the small amount of pain experienced by the initial users who end up on a sad path is more than offset by the gain achieved by other users who stay on the happy path. The pain can be minimized by providing quick feedback and fixes to the issues that real-world users encounter. Logging tools and generic exception handlers that notify developers when something goes wrong can speed up the time to diagnose and fix issues.

It’s hard to prevent users from ever ending up on the sad path, but hopefully we can quickly get them back on the happy path.