Blog

Links

The trouble with generators

posted: February 8, 2020

tl;dr: Generators are great, if used judiciously...

Now don’t get me wrong: I like generators. Like most programming concepts, they are a useful tool in certain situations.

One of those situations arose this week: I was loading some records representing people into a cloud service, and had to give each record a tag to put half the records/people into ‘segment_a’ and half into ‘segment_b’. A generator is an elegant solution for situations like this, where a long sequence of values need to be created, and the next value in the sequence depends on the previous value(s). I wrote perhaps the simplest generator I’ve ever written.

Here it is in Python, along with a few print statements that demonstrate how it is used:

def segment_gen():
    while True:
        yield 'segment_a'
        yield 'segment_b'

segment = segment_gen()
print('First person is in', next(segment))
print('Second person is in', next(segment))
print('Third person is in', next(segment))

And here it is in TypeScript:

function* segmentGen(): Generator<string> {
  while (true) {
    yield 'segment_a';
    yield 'segment_b';
  };
}

const segment = segmentGen();
console.log('First person is in', segment.next().value);
console.log('Second person is in', segment.next().value);
console.log('Third person is in', segment.next().value);

I could have used a closure, but I find a generator, with its ‘yield’ keyword, to be simpler and more explicit. As long as the person reading the code knows what ‘yield’ does (it returns a value and pauses execution at that point, then resumes after that point the next time the generator is invoked), it is easier to understand than a closure. I find that people often have trouble wrapping their brains around what is happening inside a closure.

The main situation where generators are invaluable is when processing large amounts of data that won’t all fit in the computer’s memory at once. A generator can be used to grab individual records or chunks of records from the source. Those records can then be processed and written to the ultimate destination before the next records are accessed. This effectively creates a stream of records, which can be transformed via a pipeline of generators. The burden on the computer’s memory while the code is running is just the memory space needed to store the chunk of records being processed at any point in time, not the entire data set.

But processing records in this manner will likely demonstrate the main downside of generators: they are hard to use (consume) and brittle.

At my last company, a developer who was learning to program read about generators and fell in love. He had the task of adding some features to code which loaded a few thousand records and did a series of transformations before writing them to the final destination. He decided to migrate the job to use a pipeline of generators, instead of the prior structure of reading all the records into a list/array and processing them while in memory.

I pointed out that this code was already working and it would only ever have to process a few thousand records, which maybe consumed a few megabytes of memory. His local machine had gigabytes of memory, and when the code was pushed to the cloud, it would run on a server with even more gigabytes of memory. Generators weren’t needed.

He rewrote the code to use generators anyway. After the job was deployed to the cloud, I noticed that the record counts didn’t look right. Whatever the number of records in the original source (a Google Sheet), the job appeared to load one fewer. It wasn’t clear if a record was getting skipped somehow, or if the counts were off by one.

I initially focused on the main loop which processed records, inserting debug print statements and repeatedly running the job. It was a ‘for’ loop that consumed the generator and ran the records through the pipeline. It was hard to look at the interim state of the generator pipeline to see if a record was being skipped, because the entire record set never lived inside memory. There were situations where records were skipped if certain data wasn’t present, but I still couldn’t find the source of the problem.

After trying to figure out what was happening inside the pipeline, I finally focused my attention elsewhere. Well before the ‘for’ loop, I finally found a single ‘next’ statement that was grabbing the first element of the generator in order to take a look at some of the fields, to set up some other configuration for the job. The developer had forgotten that the ‘next’ statement starts consuming the generator, so that the ‘for’ loop was starting on the second record in the data set. The first record was always being skipped.

It took hours to find this problem, and days had already elapsed with slightly erroneous data being loaded by the job.

The pipeline of generators was also brittle. Because the job was consuming data entered by human users, it needed to be robust enough to handle all sorts of crazy input. It wasn’t robust, and when unexpected data was encountered, the job would throw an exception somewhere in the pipeline of generators, producing a big stack trace and leaving very few other clues about what had gone wrong. It wasn’t possible to simply print out parts of the in-memory set of records while the job ran in the cloud, because the entire data set never resided in memory at once.

Using generators for this job was a classic case of premature optimization, or completely unneeded optimization, because the number of records would never expand to the point of not being able to fit in memory. The effort put into transforming the job to use generators was wasted, and caused additional problems that took hours, over the course of days, to debug.

Lists and arrays are great data structures. Once the data is in memory you can iterate over it multiple times, forwards and backwards and by various step sizes; you can slice it; you can access any element you need, the first or last or any in between; you can sort it in place; and you can transform it into new lists via a variety of methods, including map, filter, and list comprehensions. Generators are also great, but you can only consume them once, from start to finish. Generators should be used judiciously.