Many “analytics professionals” or “quants” I know or have worked with have no hesitation in diving straight into a statistical model when they are faced with a problem, rather than trying to understand the data. However, that is not the way I work. Whenever I set out solving a new problem, I start with spending considerable time trying to get a feel of the data. There are many things I do to “feel” the data – look at a few lines of data, look at descriptive statistics of some of the variables and distributions of individual variables. The most powerful tool, however, that lets me get a feel for data is the humble scatterplot.
The beauty of the scatter plot is that it allows you to get a real feel for the data. Taking variables two at a time, it not only shows you how each of them is distributed but also how they are related to each other. Relationships that are not apparent when you look at the data become apparent when you graph them. I may not be wrong in saying that the scatterplot defines the direction and scope of your entire solution.
The problem with the debate on how analytics needs to be done is that it is loaded. A large majority of people who use statistics in their daily work dive straight into analysis without looking at the data. Perhaps they deem that looking at data is a waste of time? I have even seen pitch decks by extremely reputed software companies that propose solutions such as “we will solve this problem using Logistic Regression” without even having seen the data.
Let us take an example now. Take the following four data sets (my apologies for putting an image here):
Let us say you dive straight into the analysis. Like a good “analytics professional” you dive straight into regression. You may even do some descriptive statistics for each of the data sets along the way. And this is what you find (again, apologies for the image)
Do you conclude that the four data sets are the same? Pretty much identical statistics right? I wouldn’t be surprised if you were to publish that there is nothing to differentiate between these four data sets. Now, let us do a simple scatter plot of each of these data sets and check for ourselves:
Now, do you still think these data sets are identical? Now you know why I stress so much upon getting a feel for the data and drawing the humble scatter plot?
The data set I’ve used here is a rather famous one, and it is called Anscombe’s Quartet. The purpose of the data set is to precisely describe what I have in this post. That one needs to get a feel for the data before diving into the analysis. Draw scatter plots for every pair of variables. Understand the relationships, and let this understanding guide your further analysis. If one were able to perfectly analyze every piece of data by diving straight into a regression, the job of analytics might as well be outsourced to computers.
PS: it is a tragedy that when they teach visualization in school they don’t even mention the scatter plot. At a recent workshop I asked the participants to name the different kinds of graphs they knew. “Line”, “Bar” and “Pie” were the mots common answers. Not one answered “scatter plot”. Given the utility of this simple plot this is indeed tragic.