posted: June 26, 2021
tl;dr: A fun test that quite often surfaces character encoding issues across a software systems...
All too often, we in the software profession focus our time and attention on normal, regular, well-structured data that meets our own biased expectations. We may get a software system working to our satisfaction, only to be surprised later when it is deployed to production and strange results happen. Users, who do not share our preconceptions, will enter into the system whatever data they want, whether intentionally or by mistake. That user data may surface issues that existed all along, but which escaped our quality control efforts because we didnât think broadly enough about what data might enter into our system.
Character encoding issues often are noticed this way, in production. Character encoding tends to be glossed over by us software developers, or we make assumptions that are not true. Often we forget that characters are not atomic: they are made out of smaller particles: bits and bytes. There is no such thing as a universally recognized character; there are just bits and bytes. The character we interpret from the actual bits and bytes depends entirely on the character encoding scheme. There are many, many character encoding schemes which have been designed over the decades, a good number of which still remain in use in various systems.
Characters have to be encoded everywhere they are processed and stored: in memory, on disk in files, when traversing a network, in a web page sent to a browser, in an email sent between users in different countries. This creates plenty of opportunities for different character encoding schemes to be implemented in different systems that all need to speak to each other. Problems occur when one particular sequence of bits and bytes is interpreted to be a certain character on one system but a different character on another.
Those of us in the United States have had it easy from the early days of computers, because the alphabet we use (26 upper and lower case letters, without accents or other diacritics) was fully contained within the earliest widespread character encoding schemes, especially the original 7-bit ASCII character set. Thus we could confidently type a name such as âChrisâ and have it encoded in a sequence of 7-bit numbers, each of which fits into a byte. But what about JosĂ©? What about Elon Muskâs child, X Ă A-12 Musk?
Thatâs why additional character sets were designed, to extend the range of characters that can be encoded. This extension is still going on today, especially in the realm of newly standardized emojis. There are now many thousands of characters that have been standardized, with several ways of turning them into sequences of bytes. But when these new characters traverse older legacy systems, which still operate with older character sets, bad things can happen: weird-looking characters, strings that get mangled or cut off, or even failure of the system to process a record, form, or file provided as input.
To surface character encoding issues before a system goes into production, I like to do a fun, simple test that I call the emoji test: sprinkle some emojis in among the characters in the input to the system, then look carefully at all destinations downstream to see what happened to those characters. Did records with emoji characters get written to the database? Did the emojis make their way into all the third-party APIs that connect to the system? Did they get reflected in the user output of the system, such as an email sent to the user? Do they appear when a record is retrieved and viewed by the user in the future? Or did something go awry along the way, with perhaps some internal warnings or exceptions?
It is often not possible to make every aspect of the system handle the emojis properly. There just might be some incompatible character sets deep within the system. Sometimes it is possible to adjust the character set that a subsystem uses, which may solve a particular problem. Other times, however, it can be hard to even determine what character set a subsystem is using, which can lead to a trial-and-error approach. But if you are able to improve the ability of your system to handle emojis, you almost certainly are making life better for your users named José.
An aside: when regex is used to filter out undesired input, it can sometimes make life difficult for users of alphabets with accented characters, even if the underlying character encoding scheme supports those characters. One common regex pattern, to attempt to get the user to only type alphabetic characters, is: [A-Za-z]. The problem with this pattern is that JosĂ© wonât be able to enter his name properly, whereas Chris will. A more sophisticated solution is needed.