Chunking data

posted: August 7, 2021

tl;dr: Some tips for speeding up data processing jobs by chunking data...

When processing data that consists of a large number of records, processing the records individually can often be too slow. Usually the records have to be obtained from some external source, such as a database, an API, or a file from the file system. To obtain a record, a query or API request or a file system read is performed, which introduces an uncontrollable delay before the database, API, or file system responds with the record. Once the record is processed in local memory and a result is obtained, such as a transformed record, that result often needs to be sent to another destination, such as (again) a database, an API, or the file system. Writing the resulting record to the destination introduces a second uncontrollable delay before the database, API, or file system responds with confirmation that the record has been written.

Processing data in memory is fast, even in interpreted languages such as Python and JavaScript. So the external system delays in the inbound and outbound directions can easily dominate the amount of wall clock time that it takes for a data processing job to run. It’s not unusual to have a job that spends less than 1% of its time processing the data in memory, and 99% of the time waiting for requested data to show up and for written data to be confirmed. A runtime profiling tool can provide a more precise figure. If the total wall clock time needed to run the entire job and process all the records exceeds the upper bound of what users will accept, then optimizations are needed.

Chunking the data can dramatically speed up the job. I don’t think you’ll find the term “chunk” in too many computer science textbooks, but it is a common software industry term. A “chunk” is just a collection or group of records of some fixed size, such as 10 or 100 records. When reading from or writing to an external system, a large amount of the delay is independent of the number of records in transit. It might take a database 100 milliseconds to respond with one record, and 110 milliseconds to respond with 10 records, and 200 milliseconds to respond with 100 records. With these values, a job that needs to process 1000 records would require 100 seconds to read them 1 at a time, or 11 seconds to read them 10 at a time, or 2 seconds to read them 100 at a time.

Chunking, however, introduces some additional code complexity and perhaps some unreliability. Usually, external systems do not tell a programmer the ideal amount of data to put into a chunk. The larger the chunks are, the more burden is placed on the external system, and the system may fail or hit some internal performance limitations when the chunk size is very large. Also, it may be harder to diagnose failed jobs. When a job processes one record at a time and it fails by throwing an exception, it’s easy to see exactly which record caused the exception. But when a job processes chunks of records, and fails because of a problem with one record in the chunk, it’s often hard to determine which record caused the exception. It can also be harder to rerun the job by skipping over all the records that have been successfully processed, because some of the records in the chunk may have succeeded whereas others may have failed.

Chunking records which are in memory is pretty easy:

Python:

# records is a list
CHUNK_SIZE = 10
for i in range(0, len(records), CHUNK_SIZE):
chunk = records[i:i+CHUNK_SIZE]
# Now process the chunk

JavaScript:

// records is an array
const CHUNK_SIZE = 10;
for(let i=0; i<records.length; i=i+CHUNK_SIZE) {
let chunk = records.slice(i, i+CHUNK_SIZE);
// Now process the chunk
}

Using chunks with external systems is a bit harder. Databases can deal with chunks by using modified queries, and by using cursors to advance through the records returned by a query. External APIs will need to have a way to deal with bulk records. If they don’t, and you can’t do asynchronous processing, then there may be no way to speed up the flow of data to/from a particular external API.

The common challenge to all programs which chunk data is determining the best chunk size to use. Here are my tips:

Start off by not chunking

“Premature optimization is the root of all evil,” as computer scientist Donald Knuth is fond of saying. In other words, write your code in as simple a manner as possible, which means no chunking, and then run it against real world data sets. If and only if performance is unacceptable should you refactor the code to introduce chunking.

Make the chunk size easy to change

Knowing that you’ll want to test the performance of various chunk size values, you should make the chunk size easy to change across multiple runs. One way to do this is to make CHUNK_SIZE a constant global variable. You may need a few CHUNK_SIZE variables, if you want to use different chunk sizes across different external systems.

Start with a chunk size of 1

After the code has been refactored to chunk data and do bulk operations with the external systems, try a chunk size of one. This is to ensure that the chunking handles this important boundary case. Even when you have millions of records to chop up into chunks, you’ll occasionally end up with a single record in the last chunk. So the code will always need to be able to handle a chunk size of one. We programmers are infamous for making off-by-one errors, so we should test this boundary case. The program will probably run slightly slower than it did before any chunking was added and each record was always processed individually.

Keep increasing chunk size by an order of magnitude, then back off

After a successful run with a chunk size of 1, try 10, then 100, then 1000, etc. You should see your program speed up considerably. Eventually, with a large enough chunk size, the program will likely fail. An external system may have been given too large a chunk of data, or the data may be too large to transmit reliably between your program and the external system. Once you know the chunk size where the program fails, back off by 50% and see if that chunk size runs reliably. Fine tune the chunk size until the program runs both quickly and reliably. Be prepared to adjust the CHUNK_SIZE again in the future, in production, based on variations in the data and the external systems over time.

If this sounds like turning a knob until the system works, instead of calculating the best position for the knob based on scientific principles, that’s because it is. Experience can help in this process.