Bad Data Analysis

This is a post tangentially related to work, so I must point out that all views here are my own, and not views of my employer or anyone else I’m associated with

The good thing about data analysis is that it’s inherently easy to do. The bad thing about data analysis is also that it’s inherently easy to do – with increasing data democratisation in companies, it is easier than ever than pulling some data related to your hypothesis, building a few pivot tables and charts on Excel and then presenting your results.

Why is this a bad thing, you may ask – the reason is that it is rather easy to do bad data analysis. I’m never tired of telling people who ask me “what does the data say?”, “what do you want it to say? I can make it say that”. This is not a rhetorical statement. As the old saying goes, you can “take data down into the basement and torture it until it confesses to your hypothesis”.

So, for example, when I hire analysts, I don’t check as much for the ability to pull and analyse data (those can be taught) as I do for their logical thinking skills. When they do a piece of data analysis, are they able to say that it makes sense or not? Can they identify that some correlations data shows are spurious? Are they taking ratios along the correct axis (eg. “2% of Indians are below the poverty line”, versus “20% of the world’s poor is in India”)? Are they controlling for instrumental variables?

This is the real skill in analytics – are you able to draw logical and sensible conclusions from what the data says? It is no coincidence that half my team at my current job has been formally trained in economics.

One of the externalities of being a head of analytics is that you come across a lot of bad data analysis – you are yourself responsible for some of it, your team is responsible for some more and given the ease of analysing data, there is a lot from everyone else as well.

And it becomes part of your job to comment on this analysis, to draw sense from it, and to say if it makes sense or not. In most cases, the analysis itself will be immaculate – well written queries and logic / code. The problem, almost all the time, is in the logic used.

I was reading this post by Nabeel Qureshi on puzzles. There, he quotes a book on chess puzzles, and talks about the differences between how experts approach a problem compared to novices.

The lesson I found the most striking is this: there’s a direct correlation between how skilled you are as a chess player, and how much time you spend falsifying your ideas. The authors find that grandmasters spend longer falsifying their idea for a move than they do coming up with the move in the first place, whereas amateur players tend to identify a solution and then play it shortly after without trying their hardest to falsify it first. (Often amateurs, find reasons for playing the move — ‘hope chess’.)

Call this the ‘falsification ratio’: the ratio of time you spend trying to falsify your idea to the time you took coming up with it in the first place. For grandmasters, this is 4:1 — they’ll spend 1 minute finding the right move, and another 4 minutes trying to falsify it, whereas for amateurs this is something like 0.5:1 — 1 minute finding the move, 30 seconds making a cursory effort to falsify it.

It is the same in data analysis. If I think about the amount of time I spend in analysing data, a very very large percentage of it (can’t put a number since I don’t track my time) goes in “falsifying it”. “Does this correlation make sense?”; “Have I taken care of all the confounding variables?”; “Does the result hold if I take a different sample or cut of data?”. “Has the data I’m using been collected properly?”; “Are there any biases in the data that might be affecting the result?”; And so on.

It is not an easy job. One small adjustment here or there, and the entire recommendations might flip. Despite being rigorous with the whole process, you can leave in some inaccuracy. And sometimes what your data shows may not conform to the counterparty (who has much better domain knowledge)’s biases – and so you have a much harder job selling it.

And once again – when someone says “we have used data, so we have been rigorous about the process”, it is more likely that they are more wrong.

Analytics and complexity

I recently learnt that a number of people think that the more the number of variables you use in your model, the better your model is! What has surprised me is that I’ve met a lot of people who think so, and recommendations for simple models haven’t been taken too kindly.

The conversation usually goes like this

“so what variables have you considered for your analysis of ______ ?”
“A,B,C”
“Why don’t you consider D,E,F,… X,Y,Z also? These variables matter for these reasons. You should keep all of them and build a more complete model”
“Well I considered them but they were not significant so my model didn’t pick them up”
“No but I think your model is too simplistic if it uses only three variables”

This is a conversation i’ve had with so many people that i wonder what kind of conceptions people have about analytics. Now I wonder if this is because of the difference in the way I communicate compared to other “analytics professionals”.

When you do analytics, there are two ways to communicate – to simplify and to complicate (for lack of a better word). Based on my experience, what I find is that a majority of analytics professionals and modelers prefer to complicate – they talk about complicated statistical techniques they use for solving the problem (usually with fancy names) and bulldoze the counterparty into thinking they are indeed doing something hi-funda.

The other approach, followed by (in my opinion) a smaller number of people, is to simplify. You try and explain your model in simple terms that the counterparty will understand. So if your final model contains only three explanatory variables, you tell them that only three variables are used, and you show how each of these variables (and combinations thereof) contribute to the model. You draw analogies to models the counterparty can appreciate, and use that to explain.

Now, like analytics professionals can be divided into two kinds (as above), I think consumers of analytics can also be divided into two kinds. There are those that like to understand the model, and those that simply want to get into the insights. The former are better served by the complicating type analytics professionals, and the latter by the simplifying type. The other two combinations lead to disaster.

Like a good management consultant, I represent this problem using the following two-by-two:

analytics2by2

 As a principle, I like to explain models in a simplified fashion, so that the consumer can completely understand it and use it in a way he sees appropriate. The more pragmatic among you, however, can take a guess on what type the consumer is and tweak your communication accordingly.

 

 

The Swarovski Earrings

On Friday evening I tweeted:

Louis philippe best white shirt – rs X1
Swarovski crystal earrings – rs X2
Dinner at taj west end – rs X3
Proposal accepted – priceless

Now I must confess that there was a lie. Which I tried to mask by using variables for the various values. Of course, at the time of tweeting this, I didn’t know the value of X3; though I figured it out an hour later. The value of X1 is well known. The lie was in the X2 bit. The thing is I don’t know. Because the Swarovski crystal earrings weren’t bought; they were won.

Back in 2000 when I entered IIT Madras, I started doing extremely bad in quizzes there. It took me a long while to get adjusted to the format there (long questions, all-night quizzes… ) and a lot of stuff that got asked there was about stuff that I didn’t care much about so I didn’t really bother doing well. There’s this old joke that every IITM quiz should start and end with a Lord of the Rings (LOTR) question with two more LOTR questions in the middle, and all this is only in one half of the quiz.

In my first year there, there was also the additional problem of finding good people to quiz with. You invariably ended up going with someone either from your hostel or your class who might have attended their school trials for the Bournvita Quiz Contest, or sometimes quizzers you know from Bangalore. Still, the lack of a settled team meant that there was a cap on how well one could do. All through first year, I didn’t qualify in a single quiz, neither in Madras nor when I came home to Bangalore.

Second year was marginally different. There was still no settled team but the format wasn’t strange any more. And quizzes had started to get a little more general and less esoteric. I had started to qualify, or just miss qualification, in some quizzes. And around this time, while struggling with VLSI circuits and being accused by the Prof of being potential WTC Bombers (this was a few days after 9/11) I heard God and Ranga talk about some Dakshinachitra where they had qualified for the finals.

So Dakshinachitra is this heritage center on East Coast Road and they had been conducting an India Quiz. It was a strange format – three rounds of prelims with two teams (of two people) qualifying from each round. God and Ranga had gone for the first round of prelims and had sailed through. They had told me the competition hadn’t been too tough and so the following week Droopy and I headed out, taking some random local bus to the place.

We too made it peacefully to the finals and then found that it had turned out to be an all-IIT finals. However, they refused to shift the venue of the finals to the IIT campus and so all of us had to brave the Saturday afternoonMadras sun and head out again to the place. Thankfully this time they’d organized a bus from somewhere close to IIT.

I don’t remember too much of the finals apart from the fact that there was a buzzer round with extremely high stakes, in which Droopy and I did rather well. I remember one question in the buzzer round being cancelled because an audience member shouted out the answer. I remember there was this fraud-max specialist round where we were quizzed on a topic we’d picked beforehand. Thankfully the stakes there weren’t too high. It wasn’t a great quiz by general quizzing standards but what mattered was we won, marginally ahead of God and Ranga in a close finish.

The next morning Droopy and I appeard in the supplement pages of the New Indian Express, holding this huge winner’s certificate with Air India’s name on it (they took back that certificate as soon as the photo was taken). We were promised one return ticket each by Air India to any destination in some really limited list, but somehow they frauded on it and we could never fly. God and Ranga got a holiday each in some resort, and I don’t think they took that, too.

There were a lot of random things as prizes. There were some random old music CDs. Maybe some movie CDs too. I remember God and Ranga getting saris (god (not God, maybe God also) knows what they did with it). Droopy and I got coupons from VLCC. I put NED to encash them. Droopy went and was given a free haircut. And then there were these earrings.

Not knowing what to do with them, I just gave them to my mother. She, however, refused to wear them saying that since I’d won them, it was only appropriate that they go to my wife. So she put them away in the locker in my Jayanagar house and told me to take them out only when I had decided who I wanted to marry. And I, then a geeky 18-year old IITian, had decided to use these earrings while proposing marriage.

So early in the evening on Friday I went to the Jayanagar house and took the earrings out of the locker. What followed can be seen in the tweet. Oh, and now you might want to start following this blog.

PS: apologies for the extra-long post, but given the nature of the subject I suppose you can’t blame me for getting carried away

What’s your Raashee? Astrology and Vector Length

The problem with western astrology is that there are way too few categories of people according to it. Western astroogy uses a vector of length one – the part of year in which you were born in, and then concocts a story based on that. According to that, people can be classified into twelve categories (as can be seen in the great recent movie whose title is a substring of the title of this blog post) and you can tell their story based on that. Thing is that way too many people you know, and are not like you, are in the same category as you, and this makes things so much less believable.

On the other hand, the beauty of Indian astrology is the vector length that is involved with it. There are nine planets (including the Sun and the Moon, not including the Earth, and with Rahu and Ketu instead of Uranus and Neptune) and at the time of your birth, each of them can be in one of 10 houses (not sure of the number but I think this is it). There are correlation issues so the number of possible combinations isn’t as big as you think it might be, but still there are enough possible combinations that can describe each person you know uniquely!

This ability to identify almost each person uniquely is what makes Indian astrology so fascinating. Stuff is so complicated that you will never understand it. And because you will never understand it, you are more likely to believe it; unlike in western astrology where it is easy for you to see where you fit in, where things are so easy that it is easy for you to see through it.

The other thing about Indian astrology is that given the really large number of variables, it is easy for the astrologer to correct his own mistakes. He will say “Jupiter is in position 7 so X will happen” and then if X doesn’t happen he says “yeah i predicted it based on Jupiter being in 7, but then in the meantime the Sun moved into 8, and so death happened off”. It makes things so easy to cover up that it contributes to the mystique, and to the success of the art.

So a possible moral of the story is that if you want to create fraud frameworks, make sure that they involve long vectors. Make sure that you design them in such a way that the mango person won’t understand; Make sure that you build in enough variables that will allow you to cover up in case when you screw up. Make sure the vectors in your framework are long enough to make the users feel special and unique, yet giving them a feel that you’ve seen someone/something similar before.

I think this is what all the successful consulting firms have done. Perfected this art of coming up with this kind of a vector. And to think that they might have been inspired by Indian astrology..