posted: August 14, 2021
tl;dr: If all the data fits in memory, then use memory...
The computer industry has come a long way from the time, in the early 1980s, when 64KB of memory in a personal computer was enough of a selling point to be used in the product name, as was the case with the first computer I owned, a Commodore 64. The MacBook Pro I am using to compose this post has 16GB of RAM, an increase of more than 250,000 times. 16GB holds a lot of data. If that data is records of a size of 1KB each, containing data about people, this computer can hold the data for more than 10 million people in memory, if I close all my browser tabs, windows, and other applications.
Sometimes we, as programmers, forget how much memory is available on the computer that will be running a data processing job. We write the job so that it streams some records in, processes them, and streams the results out. Or we might write it as an event-based job, where each record becomes an event that gets processed individually. This can include fanning out events to multiple processes running on multiple cores or processors, to use parallelism to process the data faster.
Streaming systems and event-based systems are great for certain types of tasks. They are good choices when:
The downside of structuring the data processing job as a streaming system or event-based system is complexity, which leads to a longer development time. A streaming system is not so bad, as this is a common interface provided by many drivers and libraries for external sources and sinks of data. An event-based system, especially one that uses parallelism, creates additional complexity in structuring the code and orchestrating the processes. It can be difficult to monitor the execution of the job, to debug the code when errors happen, and to produce counts and statistics on the number of various types of records that are processed.
Not every data processing job requires this complexity. This is why Excel can be used for so many data analysis tasks: Excel reads data into memory and operates upon it there. So one of the first questions to ask, when approaching a new data processing job, is: can the data all fit into the memory of the computer that will be running the job? If so, the easiest way to write the job is usually just to read all the data into memory, operate upon it, then send the results on to the destination. Chunking might need to be used to bring the data into memory, or to write it out.
Operating on large amounts of data, once it is all loaded into memory, tends to be fast, limited only by the raw performance of the computer in conjunction with the code that is being used to process the data. Perhaps more important is the development time needed to write the job itself, which is where this design pattern excels. This type of job also tends to be easy to debug, and easy to produce aggregate statistics that monitor the performance of the job.
Go ahead, take advantage of the gigabytes of RAM available on your computer: if all your data fits in memory, write your job to load the data, process it, then store the results.