posted: October 11, 2020
tl;dr: 7 tips and tricks for solving intermittent software problems...
One way in which software engineering is like other branches of engineering, including automotive engineering, is that it is easier to fix problems which happen every time, when the triggering conditions occur. Intermittent problems are much trickier to find and fix. If your car makes a strange noise on one out of one hundred left turns, there’s a good chance that your mechanic won’t be able to fix it, because the problem won’t happen while the car is in for repair. You may have to wait until the problem starts happening more frequently. The same applies for software.
When testing, releasing, and operating a complex software system, it is typically the case that you first have to solve all of the problems that happen every time, i.e. the one-in-one or 1:1 problems. Then you move on to the one-in-ten (1:10) problems, followed by the 1:100, the 1:1000, etc. At some point the system is working well enough, and the problems are infrequent enough, that it makes business sense, for the software vendor and the customers and users, for the system to be put in production.
The problem solving should not stop when the system is in production, however. As more customers use the system, more esoteric and infrequent issues will almost certainly crop up, because the product will experience a larger number of use cases than it ever did while it was being tested. The one-in-one-million (1:1,000,000) problems will start to happen, if the product is successful enough.
I’ve written about some of the things that can go wrong in complex software systems, in hardware, software, and networking. Many of those problems are intermittent. Here are my tips for finding and fixing intermittent problems:
Find a way to reproduce the problem every time
This is a pretty obvious one: if you can find a way to change the problem from a 1:N problem to a 1:1 problem, by coming up with some set of conditions and steps to cause the problem every time, it should be easy to solve. The trick is finding a way to do so, if it is even possible to do so. You can closely examine the state of the system and any input provided right before the failure occurs, perhaps by adding additional logging to the system and waiting for the problem to recur. You can try catching the error and dumping the state of the system right after the error occurs. Sometimes there is some characteristic of the system state, or the input, or the timing, or the ordering of events, which is directly causing the problem.
However, it’s not always possible to find a way to make the problem happen every time. Maybe an external API which writes to a database fails approximately every 1 in 10,000 times regardless of what input you give it to write, for reasons completely unrelated to the way you are using the API. Then you must try other approaches.
Gather and examine every available clue
You may already be doing this as you attempt to reproduce the problem every time, but if not you should be. Try to determine where in the code the problem is occurring, and dump as much state information as possible both before and after the problem occurs. Don’t be afraid to temporarily increase the amount of information logged as you attempt to gather more data about the conditions under which the problem occurs; the additional logging can always be removed later. Look outside the code at the computer system as a whole to assess what else may be going on when the problem occurs; perhaps there is something else in the system which is interfering with normal execution of the code. By perusing these clues, you will sometimes find the exact conditions which cause the problem every time, as the reason the problem is intermittent is that those conditions don’t occur very often.
Write a stress test
Usually the intermittent failure can be isolated to a particular section of the code. To reproduce the problem more often so that it can be diagnosed and solved, a stress test can be written which repeatedly exercises that section of code. The goal is to take a problem that might happen once a week in production and try to make it happen, in a test environment, once every few minutes. This allows data to be gathered about the problem more quickly. It also allows the code to be tweaked to see if the problem can be solved. If the problem actually is solved, the stress test can function as a regression test, to ensure that the problem never creeps back into the code.
Detect when the failure happens and retry
Sometimes it’s possible for the code itself to immediately recognize when the intermittent problem has occurred. This is common when calling external APIs which occasionally fail; either the API returns an error code and message, or a timeout occurs. Sometimes the intermittent failure can be overcome by having the code retry the operation which failed. Often retries are done with an exponential backoff delay, i.e. waiting an amount of time that increases exponentially to give the external system more time to recover so that the next operation succeeds.
This approach may not work due to the nature of the operation which fails. If it is a read-only operation, then it is usually worth a try. But if the operation writes data, or changes state, then it is much harder to retry, because you first have to determine where in the write process or state updating process the failure occurred; a partial write or update may have been done. It may not always be possible to retry an intermittent failure.
Detect the failure later and repair the state
Sometimes it is not possible to detect and fix the problem in real time, but it can be done later, if the input data is still available. This is actually a somewhat common approach when dealing with databases: you write a “database consistency” job which scans the database and fixes missing data or inconsistent state information, and then run this job every hour, day, or week. Some may consider this to be giving up, but it may be the most cost effective way of solving the problem.
Try a completely different approach
Usually the intermittent problem can be isolated to a particular section of the code. One possible solution may be to rewrite this section of code by taking a completely different approach. Maybe the ordering of events can be changed, or the code can be rewritten in a pure functional style, or a different methodology can be implemented for code which runs concurrently. If there is a reliance on an external library, and a suspicion that the problem may reside in that library, replacing the library may actually be the true solution to the problem. This tactic may take a while to implement, and may not solve the problem, so these downsides need to be considered before embarking on this path.
Live with it
Not every problem makes economic sense to resolve. If the impact of the problem is minor, and it cannot be found and fixed after spending some time trying the various approaches described above, it might be best just to live with the deficiency. There may be other bugs or new features which users would prefer to see addressed with the time that would otherwise be spent solving the intermittent problem.