Bad Data Analysis

This is a post tangentially related to work, so I must point out that all views here are my own, and not views of my employer or anyone else I’m associated with

The good thing about data analysis is that it’s inherently easy to do. The bad thing about data analysis is also that it’s inherently easy to do – with increasing data democratisation in companies, it is easier than ever than pulling some data related to your hypothesis, building a few pivot tables and charts on Excel and then presenting your results.

Why is this a bad thing, you may ask – the reason is that it is rather easy to do bad data analysis. I’m never tired of telling people who ask me “what does the data say?”, “what do you want it to say? I can make it say that”. This is not a rhetorical statement. As the old saying goes, you can “take data down into the basement and torture it until it confesses to your hypothesis”.

So, for example, when I hire analysts, I don’t check as much for the ability to pull and analyse data (those can be taught) as I do for their logical thinking skills. When they do a piece of data analysis, are they able to say that it makes sense or not? Can they identify that some correlations data shows are spurious? Are they taking ratios along the correct axis (eg. “2% of Indians are below the poverty line”, versus “20% of the world’s poor is in India”)? Are they controlling for instrumental variables?

This is the real skill in analytics – are you able to draw logical and sensible conclusions from what the data says? It is no coincidence that half my team at my current job has been formally trained in economics.

One of the externalities of being a head of analytics is that you come across a lot of bad data analysis – you are yourself responsible for some of it, your team is responsible for some more and given the ease of analysing data, there is a lot from everyone else as well.

And it becomes part of your job to comment on this analysis, to draw sense from it, and to say if it makes sense or not. In most cases, the analysis itself will be immaculate – well written queries and logic / code. The problem, almost all the time, is in the logic used.

I was reading this post by Nabeel Qureshi on puzzles. There, he quotes a book on chess puzzles, and talks about the differences between how experts approach a problem compared to novices.

The lesson I found the most striking is this: there’s a direct correlation between how skilled you are as a chess player, and how much time you spend falsifying your ideas. The authors find that grandmasters spend longer falsifying their idea for a move than they do coming up with the move in the first place, whereas amateur players tend to identify a solution and then play it shortly after without trying their hardest to falsify it first. (Often amateurs, find reasons for playing the move — ‘hope chess’.)

Call this the ‘falsification ratio’: the ratio of time you spend trying to falsify your idea to the time you took coming up with it in the first place. For grandmasters, this is 4:1 — they’ll spend 1 minute finding the right move, and another 4 minutes trying to falsify it, whereas for amateurs this is something like 0.5:1 — 1 minute finding the move, 30 seconds making a cursory effort to falsify it.

It is the same in data analysis. If I think about the amount of time I spend in analysing data, a very very large percentage of it (can’t put a number since I don’t track my time) goes in “falsifying it”. “Does this correlation make sense?”; “Have I taken care of all the confounding variables?”; “Does the result hold if I take a different sample or cut of data?”. “Has the data I’m using been collected properly?”; “Are there any biases in the data that might be affecting the result?”; And so on.

It is not an easy job. One small adjustment here or there, and the entire recommendations might flip. Despite being rigorous with the whole process, you can leave in some inaccuracy. And sometimes what your data shows may not conform to the counterparty (who has much better domain knowledge)’s biases – and so you have a much harder job selling it.

And once again – when someone says “we have used data, so we have been rigorous about the process”, it is more likely that they are more wrong.

Put Comment