Spreadsheets (arch.)

posted: March 24, 2018

tl;dr: Spreadsheets had their day (actually decades) in the sun, but night is falling...

Spreadsheets were the original “killer app” for the nascent pre-IBM PC personal computer industry. Before Google Sheets there was Microsoft Excel, before Excel there was Lotus 1-2-3, and before Lotus 1-2-3 there was Visicalc. Visicalc was a orders-of-magnitude improvement over the most widely used technology at the time: large hand-calculated ledger sheets, as in sheets of paper. Visicalc was the primary reason that businesspeople started buying Apple II and other early personal computers. I first used Visicalc on a TRS-80 Model II. Back in the late 1970s and early 1980s, spreadsheets were revolutionary.

The all-too-ubiquitous, unchanging spreadsheet user interface

Fast forward forty years and it is becoming clear to some that spreadsheet technology has significant problems and that there are often better solutions available. I was pleased to see a Wall Street Journal article entitled Stop Using Excel, Finance Chiefs Tell Staff. I’ve been anti-spreadsheet for years, for the following reasons:

Spreadsheets are error-prone

At my last company I was famous (or infamous, depending on one’s perspective) for finding errors in spreadsheets (fortunately at my current company we don’t have too many spreadsheets). Someone would develop a spreadsheet and circulate it, and rather than just accept the numbers and conclusions drawn from them, I’d dive deeper. I might see a number in the spreadsheet that looked a little bit off, so I’d dive into the formulas beneath the cells to see how the calculations were being done. I’d often find errors, many of which were the same types of errors made in programming (more about this in a bit): off-by-one errors, failing to include the entire range of cells containing the data for a calculation, constants being mistyped, etc. Based on my own personal experience in the business world, I would estimate that the odds of a spreadsheet of moderate complexity (20 formulae and 200 cells) containing at least one error is above 90%. Blatant errors that make the resulting numbers look way off are usually caught by the spreadsheet author, but subtle errors persist.

These subtle errors can be dangerous if the spreadsheet results are being used to make a major business decision involving large sums of money. As part of the acquisition process investment bankers almost always produce a Discounted Cash Flow analysis in the form of a spreadsheet which shows the expected future cash returns from the investment. Product managers similarly make a business case spreadsheet showing the expected future sales and profits for a new product. If a subtle error is made that affects the growth rate of the business or product, the resulting profit calculation can be way off when that errant growth rate is compounded forward into perpetuity. There are many examples of companies making bad investment decisions involving millions of dollars because of spreadsheet errors; see this article as a starting point.

Of course there are some workarounds that can make spreadsheets less error-prone. See Scientists rename human genes to stop Microsoft Excel from misreading them as dates.

Spreadsheets hide the source code

Few users of spreadsheets realize that what they are actually doing is programming: the formulae that they enter are the source code, and the data that they enter are, well, data. The spreadsheet program itself (e.g. Excel) is actually an interpreter that (assuming auto-recalc is enabled) instantly interprets the current formulae and data to produce the results shown in the spreadsheet cells. If auto-recalc is not enabled, then when the user clicks on “recalc” what he/she is actually doing is running the code to produce the results. The programming language is whatever the syntax is for entering formulae into the cells; this is why some sharp observers have noted that the world’s most popular programming language is not C, Java, or Python, but rather Excel.

The visual display of the spreadsheet, the table of cells, primarily shows just the results of running the spreadsheet interpreter. There’s no easy way to simultaneously see all the source code (the formulae) of the spreadsheet, and hence no easy way for that source code to be reviewed by others, tested, and run against different sets of data, either manually or (ideally) in an automated fashion. So there’s no easy way to apply all the good programming techniques that have been developed over the years to improve software quality.

The spreadsheet user interface (UI) is no doubt responsible for the mass market success of spreadsheets; I know plenty of people personally who are averse to programming but who love to create spreadsheets. The spreadsheet UI lulls spreadsheet users into a false sense of security. Most users don’t actually realize that they are writing code, and hence don’t realize that they need to practice good software hygiene.

Spreadsheet data is almost always out-of-date

Unless the original source of the data is the spreadsheet itself, which is rarely the case, the data in the spreadsheet will be out-of-date as time marches forward and the original source data changes. Spreadsheets are an offline model of what is now a hyper-connected online world with exponential growth in the amount of data being generated and stored in other systems. The collaboration features of Google Sheets don’t solve this problem; they just let multiple users work together on the same offline model of reality. To get results that reflect the current state of the data and reality, one has to pull the data from the original source right before the calculations are run. While there are ways to hook some spreadsheets into certain databases, this is a solved problem in the programming world, as databases have programmatic interfaces.

Better solution exist today

The Wall Street Journal article mentions some other business software solutions. As a programmer, I gravitate towards programming solutions, especially for ongoing data analysis tasks. There have been huge strides made in the past ten years in the Python data analytics & data science community. Python itself is a powerful language that is also easy for beginners to learn. Add in Jupyter notebooks to provide a user-friendly interpreter UI and libraries such as numpy, pandas, and matplotlib for fast data analysis and visualization and the result is a much more capable toolset than any spreadsheet program. When the source code is written in Python, the code can connect in real-time to the databases containing the original source data, the code can be reviewed by others, and it can be tested. In fact it is easy to read and write to Excel files with Python; as with many tasks in the Python world, there are off-the-shelf libraries for that. This can all be done today; there’s no need to wait for Microsoft to add Python to Excel, as is rumored. Excel can just become a historical file format, or the way to pass the results of a Python data analysis program around to other people who need to see the results.

Spreadsheets are often taught in our schools, and there are competitions for spreadsheet jockeys. It’s impressive what some of these contestants can do, but I feel that once they realize that data analytics is best done programmatically, they will leave spreadsheets behind. Perhaps in the future the spreadsheet will take its rightful place in the museum of American business, right next to the adding machine, the typewriter, and the fax machine.

Other views:

3 things you don’t understand about spreadsheets