Avoiding regex hell

posted: May 25, 2019

tl;dr: Use string methods, if you can, to avoid writing, reading, testing, and debugging regex...

One of the fun aspects of teaching a Python 101 class at meltmedia is that, while covering various features of the language, I am reminded anew why I like Python. One small reason why is the richness of Python’s built-in string methods, which can often be used to avoid having to write regex.

Python also, of course, supports regex via the built-in re module. But I think it’s fair to say that even the best programmers have trouble remembering the detailed syntax of regex. Most programmers, myself included, implement regex by using cheat sheets, online documentation (such as Python’s, which explains the syntax pretty well), or regex websites such as regex101.com and regexr.com.

Even Python's inventor, Guido van Rossum, needs to look up regex details

The reason regex is difficult is that it is its own language and uses nothing but symbols. There are no English-language keywords. The symbols have different meanings in different contexts, and those contexts are defined by other symbols. It’s a bit like programming in Perl, back in the day. The end result, if you happen to create a regex that does what you want it to do, will do the job, but it may not be understandable to you in six months, or to others who have to read your code to investigate an issue. Given the emphasis I place on code readability, regex sprinkled into other code creates a problem. As the joke goes: if you have a string problem and you solve it with regex, now you have two problems.

Here’s a simple example. The following two Python functions do the same thing: test a string to see if it is a U.S. zip code, either a 5 digit zip code or a 9 digit zip code with a hyphen between the first 5 and last 4 digits:

def is_zipcode(z):
    if z.isascii(): # Restrict to ASCII since .isdigit() allows Unicode
        if len(z) == 5 and z.isdigit() or len(z) == 10 and\
                z[0:5].isdigit() and z[5] == '-' and z[6:10].isdigit():
            return True
    return False

import re
def is_zipcode_re(z):
    # The regex should have a comment which explains the algorithm
    if re.search(r'^[0-9]{5}(?:-[0-9]{4})?$', z):
        return True
    return False

The is_zipcode() function is, I submit, more readable. It can be understood by new Python programmers and probably even other programmers who aren’t familiar with Python. It’s pretty easy to determine by code inspection that it is correct. The regex in the is_zipcode_re() function isn’t too bad as far as regex goes, but it will require some understanding of regex. Some time will need to be spent making sure all the symbols are correct and in the proper order if attempting to verify it by inspection.

Both are the same number of lines, if you count the import statement and if you insist (as I would) upon a comment to explain the algorithm the regex is implementing. I’d probably not insist upon a comment to explain the algorithm in the is_zipcode() function because you can determine it by just reading the code. Another interesting issue is the need for the .isascii() method in is_zipcode(), as Python’s .isdigit() method actually accepts characters in the “above ASCII” range of the Unicode standard that have the semantic meaning of decimal digits. This is actually an issue for regex as well: if you are parsing a name to ensure that a user entered only acceptable characters in a name, you shouldn’t reject a name where the user entered a non-ASCII character by, for example, typing “José” instead of “Jose”. In many cases using string methods, or a custom-defined function, will be a better way to test for a string for semantic meaning rather than explicitly listing the entire allowed character set inside the regex.

Now there certainly are use cases where regex makes sense. You may be programming a system that only accepts regex to define validate string formats, or you may be programming in a language that lacks rich string methods. Regex is also going to perform faster than using multiple string methods and conditional tests, so if the application you’re writing has strict performance requirements or is going to be CPU bound by the amount of string parsing that needs to be done, regex is a better solution.

But if you are not in those situations, and just need to do some occasional string parsing or manipulations, try using string methods. Your future self, and others who have to read and maintain you code, will thank you.