Write the bare minimum amount of SQL

posted: October 12, 2019

tl;dr: You’re probably better off writing data processing code in a higher-level language than SQL...

SQL databases are ubiquitous and they aren’t going away anytime soon, regardless of the growing number of non-SQL alternative database technologies such as document/object stores, key-value stores, and graph databases. One of the benefits of using a SQL database is the abundance of drivers available that allow a SQL database to be programmatically queried in another language used for application development, such as Python, JavaScript, Java, and many others. The application developer is then faced with a choice: how much of the data processing functionality should be implemented in the SQL layer, and how much should be done within the application layer?

In some cases the answer is dictated by performance concerns. If the SQL database server has lots of CPU capacity, and the application CPU is a low-end smartphone, and the bandwidth between the application and the database server is limited (such as in a cellular data network), then it absolutely makes sense to do as much of the data processing as possible in SQL on the database server, transferring only the minimal amount of resulting data to the application. But if the application is running on a server itself, and there is a high bandwidth network connection between the database and application server, as is the case when they are co-located within the same cloud-based data center, then I think it is best to do as much of the data processing as possible in the application layer, using a more powerful language than SQL. For the rest of this post I’ll assume that more powerful language is Python, because Python is widely used in the data science realm.

SQL experts may feel differently. Certainly, as this post by Lukas Eder entitled 10 SQL Tricks That You Didn’t Think Were Possible illustrates, you can do some very sophisticated operations in SQL, such as recursion. Recursion also makes SQL Turing complete, so you actually can do any computation in SQL. However, unless you are a SQL expert, the examples in that post will probably start your head spinning.

I prefer to do as little in SQL as possible, and as much of my data manipulation as possible in a higher-level language like Python. Basically, I prefer to write some simple SQL queries to bring in all the data I need, and then filter it and sort it and de-dupe it and clean it and combine it and aggregate it and format it in Python. Here’s why:

1. The Python data science stack

If you are ultimately going to be doing some very sophisticated mathematical calculations on the data, or training a machine learning model, there’s a good chance that you’re going to want to use the Python data science stack. That data science stack includes pandas, which is widely used to clean up and filter tabular data (for a good introduction to pandas, see Wes McKinney’s book Python for Data Analysis). So if you’re going to be using Python anyway to analyze data that resides in a SQL database, the best approach is to bring the data into Python as quickly as possible and work with it in Python.

2. Not everything fits nicely into a table

Item one of Lukas Eder's post states that, in SQL, “Everything is a table!”. This is certainly true, but I view it as a major handicap of SQL. There are other data structures out there, especially sets and hashmaps (Python dictionaries or JSON objects), as well as generators, that simplify the solution to many common problems. Sets make de-duping a dream, hashmaps make lookups easy by effectively creating indexes on-the-fly, and generators can be used to construct data pipelines that allow multi-stage data processing without building large interim tables/arrays/lists in memory. Then there are the data structures provided by numpy (arrays) and pandas (dataframes), which come with a wealth of built-in methods to do very powerful data cleansing and advanced mathematical calculations. By getting your data out of the SQL world and into a more powerful language like Python, you can easily do complex tasks in just a few lines of code.

3. Simpler syntax

I probably put more weight on simplicity than most, and higher-level languages like Python have a simpler syntax than SQL. Simpler syntax is more likely to be understood by others and also yourself, when you read or debug your own code later. Here are a couple of quick examples:

Filtering is a common task, which can be done in one easy-to-understand line in Python. Let’s build a list of dictionaries to store records with two fields, a user id and the user’s email address

users = [{'id': 1, 'email':'ashley@gmail.com'}, {'id': 2, 'email':'bob@yahoo.com'}, {'id': 3, 'email':'cindy@outlook.com'}]

Filtering that data to produce a new data structure that removes all users with a Gmail address is trivial:

users_non_gmail = [u for u in users if not u['email'].endswith('@gmail.com')]

If you want to build a generator, to avoid building an entire new list in memory, it’s simply:

users_non_gmail = (u for u in users if not u['email'].endswith('@gmail.com'))

In SQL, assuming the same data resides in a users table, it’s a bit more cumbersome:

WITH users_non_gmail AS (
    SELECT *
    FROM users
    WHERE email NOT LIKE '%@gmail.com'
)

will cause SQL build another entire table, the users_non_gmail table, in memory.

De-duping data is a major database challenge. I’ve worked with many different client SQL databases over the years, and almost every database I’ve worked with has had duplicate records that need to be removed. This can certainly be done in SQL, but the prevalence of SQL databases with duplicate records indicates that not many people know how to do this. In Python, thanks to the set data structure, it’s pretty easy to go through a dataset with duplicate records and construct two datasets, one with the first occurrences of each record and a second with the duplicate records.

Starting with a data structure that has four user records, two of which have the same email address:

users = [{'id': 1, 'email': 'ashley@gmail.com'}, {'id': 2, 'email': 'bob@yahoo.com'}, {'id': 3, 'email': 'cindy@outlook.com'}, {'id': 4, 'email': 'ashley@gmail.com'}]

you can use a Python set to help de-dupe the records:

user_emails = set()
users_dedupe = []
users_dupe = []
for u in users:
    if u['email'] not in user_emails:
        users_dedupe.append(u)
        user_emails.add(u['email'])
    else:
        users_dupe.append(u)

After running this code, users_dedupe will contain the de-duplicated user records, users_dupe will have the duplicate records that need to be removed from the database, and user_emails will have all the unique user email addresses.

Yes, there are ways to solve the duplicate record problem in SQL by setting up schemas and enforcing unique values. But the bottom line is that I’ve seen too many SQL database problems in my years to believe that programmers know enough SQL to prevent these problems from cropping up in the first place. I blame a good portion of that on the challenges of SQL syntax.

4. More people know Python (and other languages) than advanced SQL

Python is being taught at the university level, in high schools, and even in grade schools these days. SQL, because it is not a general purpose programming language, isn’t taught as much in educational settings. Yes, it is an important skill for many software developers to have, and it is often learned on-the-job. SQL experts like Lukas Eder are the exception rather than the rule: many more programmers prefer to work in their favorite high level language. Code ultimately needs to be understood by more people than just the author, which is another reason to put the data processing functionality in a high level language.

Related post: De-duplicating database records