Play with the data

posted: July 18, 2020

tl;dr: Spend time playing with the data to see what unexpected insights can be gleaned from it...

I’ve already written about the first step in the implementation phase of any data-oriented task: look at the data. Data always is messy, and a fair bit of time will need to be spent understanding it, uncovering the errors and quirks, and cleaning it up. Then the fun can begin! But don’t just try to satisfy the request that has been made: be sure to spend some time playing with the data, to see what additional insights it might yield.

I am typically asked to perform an analysis, develop a model, or make predictions, by a client, which could be someone inside or outside the company. The client may have some idea of what they want but often it will be vague, for example: “give me a report showing which of our promotions last quarter were most effective”. I’m being asked because I have a better command of data analysis tools than does my client. While performing the analysis I will also be closer to the actual raw data than the client.

It definitely helps to understand the client’s business before embarking upon an analysis. Some of the questions that I try to explore with the client are:

During and after cleaning up the data, I like to play with the data. I’m not only looking for ways to answer the primary request that the client has made, but also for unexpected insights that might cause the client to understand their business in a whole new way. As the person who is closest to the actual data, if the analyst doesn’t bother to look deeply into it, there's a good chance that no one will.

Some of the aspects that I analyze are:

Those last two items can yield some especially rich insights. To cite just one example of an unexpected correlation, I’ve seen major differences in user behavior based upon what email service provider or search engine they are using. This makes a certain amount of sense when you realize that the email service provider or search engine might be a proxy for the user’s age, how technically sophisticated the user is, or other aspects of user personality. It can guide future decisions about how best to reach the target audience.

Outliers, whether individual data points or small clusters of data points, can easily yield unexpected insights. Or they can result from data errors that point to improvements that should be made in subsequent data collections. Or, in the website world that I live in, they might mean the site has been hacked. When I find some outliers, I like to dive in and attempt to determine what event or sequence of events produced them.

Cleaning up the data is a necessity for performing an accurate data analysis. Playing with the data is where the fun really comes in. I like to see if I can not only give the client what they asked for, but also something that they didn’t.