A few months back, at work, a couple of kids in my team taught me this concept called “SHAP“. I won’t go into the technical details here (or maybe I will later on in this post), but it is basically an algo that helps us explain a machine learning model.

It was one of those concepts that I found absolutely mind-blowing, to the extent that after these guys taught this concept to me, it became the proverbial hammer, and I started looking for “nails” all around the company. I’m pretty sure I’ve abused it (SHAP I mean).

Most of the documentation of SHAP is not very good, as you might expect about something that is very deeply technical. So maybe I’ll give a brief intro here. Or maybe not – it’s been a few months since I started using and abusing it, and so I’ve forgotten the maths.

In any case, this is one of those concepts that made me incredibly happy on the day I learnt about it. Basically, to put it “in brief”, what you essentially do is to zero out an explanatory variable, and see what the model predicts with the rest of the variables. The difference between this and the actual model output, approximately speaking, is the contribution of this explanatory variable to this particular prediction.

The beauty of SHAP is that you can calculate the value for hundreds of explanatory variables and millions of observations in fairly quick time. And that’s what’s led me to use and abuse it.

In any case, I was reading something about American sport recently, and I realised that SHAP is almost exactly identical (in concept, though not in maths) to Wins Above Replacement.

WAR works the same way – a player is replaced by a hypothetical “average similar player” (the replacement), and the model calculates how much the team would have won in that case. A player’s WAR is thus the difference between the “actuals” (what the team has actually won) and the hypothetical if this particular player had been replaced by the average replacement.

This, if you think about it, is exactly similar to zeroing out the idiosyncrasies of a particular player. So – let’s say you had a machine learning model where you had to predict wins based on certain sets of features of each player (think of the features they put on those otherwise horrible spider charts when comparing footballers).

You build this model. And then to find out the contribution of a particular player, you get rid of all of this person’s features (or replace it with “average” for all data points). And then look at the prediction and how different it is from the “actual prediction”. Depending on how you look at it, it can either be SHAP or WAR.

In other words, the two concepts are pretty much exactly the same!

Pre-trained models

On Sunday evening, we were driving to a relative’s place in Mahalakshmi Layout when I almost missed a turn. And then I was about to miss another turn and my wife said “how bad are you with directions? You don’t even know where to turn!”.

“Well, this is your area”, I told her (she grew up in Rajajinagar). “I had very little clue of this part of town till I married you, so it’s no surprise I don’t know how to go to your cousin’s place”.

“But they moved into this house like six months ago, and every time we’ve gone there together. So if I know the route, why can’t you”, she retorted.

This gave me a trigger to go off on a rant on pre-trained models, and I’m going to inflict that on you now.

For a long time, I didn’t understand what the big deal was on pre-trained machine learning models. “If it’s trained on some other data, how will it even work with my data”, I wondered. And then recently I started using GPT4 and other similar large language models. And I started reading blogposts on how with very little finetuning these models can do “gymnastics”.

Having grown up in North Bangalore, my wife has a “pretrained model” of that part of town in her head. This means she has sufficient domain knowledge, even if she doesn’t have any specific knowledge. Now, with a small amount of new specific information (the way to her cousins’s new house, for example), it is easy for her to fit in the specific information to her generic knowledge and get a clear idea on how to get there.

(PS: I’m not at all suggesting that my wife’s intelligence is artificial here)

On the other hand, my domain knowledge of North Bangalore is rather weak, despite having lived there for two years. For the longest time, Mallewaram was a Chakravyuha – I would know how to go there, but not how to get back. Given this lack of domain knowledge, the little information on the way to my wife’s cousin’s new house is not sufficient for me to find my way there.

It is similar with machines. LLMs and other pre-trained models have sufficient “generic domain knowledge” in lots of things, thanks to the large amounts of data they’ve been trained on. As a consequence, if you can train them on fairly small samples of specific data, they are able to generalise around this specific data and learn around them.

More pertinently, in real life, depending upon our “generic domain knowledge” of different domains, the amount of information that you and I will need to learn a certain amount about a certain domain can be very very different.

Everything is context-sensitive!

Muggoos and overfitting

Back when I was a student, there was this (rather large) species of students who we used to call “muggoos”. They were called that because they would have a habit of “mugging up the answers” – basically they would learn verbatim stuff in the textbooks and other reading material, and then just spit it out during the exams.

They were incredibly hardworking, of course – since the volume of stuff to mug was immense – and they would make up for their general lack of understanding of the concepts with their massive memories and rote learning.

On average, they did rather well – with all that mugging, the downside was floored. However, they would stumble badly in case of any “open book exams” (where we would be allowed to carry textbooks into the exams) – since the value of mugging there was severely limited. I remember having an argument once with some topper-type muggoos (with generally much better grades than me ) on whether to keep exams in a particular course open book or closed book. They all wanted closed book of course.

This morning, I happened to remember this species while chatting with a friend. He was sending me some screenshots from ChatGPT and was marvelling at something which it supposedly made up (I remembered it as a popular meme from 4-5 years back). I immediately responded that ChatGPT was simply “overfitting” in this case.

Since this was a rather popular online meme, and a lot of tweets would have been part of ChatGPT’s training data, coming up with this “meme-y joke” was basically the algorithm remembering this exact pattern that occurred multiple times in the training set. There was no need to intuit or interpolate or hallucinate – the number of occurrences in the training set meant this was an “obvious joke”.

In that sense, muggoos are like badly trained pieces of artificial intelligence (well, I might argue that their intelligence IS artificial) – they haven’t learnt the concepts, so they are unable to be creative or hallucinate. However, they have been “trained” very very well on the stuff that is there in the textbooks (and other reading material) – and the moment they see part of that it’s easy for them to “complete the sentences”. So when questions in the exams come straight out of the reading materials (as they do in a LOT of indian universities and school boards) they find it easy to answer.

However, when tested on “concepts”, they now need to intuit – and infer based on their understanding. In that sense, they are like badly trained machine learning models.

One of the biggest pitfalls in machine learning is “overfitting” – where you build a model that is so optimised to the training data that it learns quirks of the data that you don’t want it to learn. It performs superbly on the training dataset. Now, when faced with an unknown (“out of syllabus”) test set, it underperforms like crazy. In machine learning, we use techniques such as cross validation to make sure algorithms don’t overfit.

That, however, is not how the conventional Indian education system trains you – throughout most of the education, you find that the “test set” is a subset of the “training set” (questions in examinations come straight out of the textbook). Consequently, people with the ability to mug find that it is a winning strategy to just “overfit” and learn the textbooks verbatim – the likelihood of being caught out by unseen test data is minimal.

And then IF they get out into the real world, they find that a lot of the “test data” is unknown, and having not learnt to truly learn from the data, they struggle.

PS: Overfitting is not the only way machine learning systems misbehave. Sometimes they end up learning the entirely wrong pattern!

Mo Salah and Machine Learning

First of all, I’m damn happy that Mo Salah has renewed his Liverpool contract. With Sadio Mane also leaving, the attack was looking a bit thin (I was distinctly unhappy with the Jota-Mane-Diaz forward line we used in the Champions League final. Lacked cohesion). Nunez is still untested in terms of “leadership”, and without Salah that would’ve left Firmino as the only “attacking leader”.

(non-technical readers can skip the section in italics and still make sense of this post)

Now that this is out of the way, I’m interested in seeing one statistic (for which I’m pretty sure I don’t have the data). For each of the chances that Salah has created, I want to look at the xG (expected goals) and whether he scored or not. And then look at a density plot of xG for both categories (scored or not). 

For most players, this is likely to result in two very distinct curves – they are likely to score from a large % of high xG chances, and almost not score at all from low xG chances. For Salah, though, the two density curves are likely to be a lot closer.

What I’m saying is – most strikers score well from easy chances, and fail to score from difficult chances. Salah is not like that. On the one hand, he creates and scores some extraordinary goals out of nothing (low xG). On the other, he tends to miss a lot of seemingly easy chances (high xG).

In fact, it is quite possible to look at a player like Salah, see a few sitters that he has missed (he misses quite a few of them), and think he is a poor forward. And if you look at a small sample of data (or short periods of time) you are likely to come to the same conclusion. Look at the last 3-4 months of the 2021-22 season. The consensus among pundits then was that Salah had become poor (and on Reddit, you could see Liverpool fans arguing that we shouldn’t give him a lucrative contract extension since ‘he has lost it’).

It is well possible that this is exactly the conclusion Jose Mourinho came to back in 2013-14 when he managed Salah at Chelsea (and gave him very few opportunities). The thing with a player like Salah is that he is so unpredictable that it is very possible to see samples and think he is useless.

Of late, I’ve been doing (rather, supervising (and there is no pun intended) ) a lot of machine learning work. A lot of this has to do with binary classification – classifying something as either a 0 or a 1. Data scientists build models, which give out a probability score that the thing is a 1, and then use some (sometimes arbitrary) cutoff to determine whether the thing is a 0 or a 1.

There are a bunch of metrics in data science on how good a model is, and it all comes down to what the model predicted and what “really” happened. And I’ve seen data scientists work super hard to improve on these accuracy measures. What can be done to predict a little bit better? Why is this model only giving me 77% ROC-AUC when for the other problem I was able to get 90%?

The thing is – if the variable you are trying to predict is something like whether Salah will score from a particular chance, your accuracy metric will be really low indeed. Because he is fundamentally unpredictable. It is the same with some of the machine learning stuff – a lot of models are trying to predict something that is fundamentally unpredictable, so there is a limit on how accurate the model will get.

The problem is that you would have come across several problem statements that are much more predictable that you think it is a problem with you (or your model) that you can’t predict better. Pundits (or Jose) would have seen so many strikers who predictably score from good chances that they think Salah is not good.

The solution in these cases is to look at aggregates. Looking for each single prediction will not take us anywhere. Instead, can we predict over a large set of data whether we broadly got it right? In my “research” for this blogpost, I found this.

Last season, on average, Salah scored precisely as many goals as the model would’ve predicted! You might remember stunners like the one against Manchester City at Anfield. So you know where things got averaged out.

Modelling for accuracy

Recently I’ve been remembering the first assignment of my “quantitative methods 2” course at IIMB back in 2004. In the first part of that course, we were learning regression. And so this assignment involved a regression problem. Not too hard at first sight – maybe 3 explanatory variables.

We had been randomly divided into teams of four. I remember working on it in the Computer Centre, in close proximity to some other teams. I remember trying to “do gymnastics” – combining variables, transforming them, all in the hope of trying to get the “best possible R square”. From what I remember, most of the groups went “R square hunting” that day. The assignment had been cleverly chosen such that for an academic exercise, the R Square wasn’t very high.

As an aside – one thing a lot of people take a long time to come to terms with is that in “real life” (industry problems) R squares aren’t usually that high. Forecast accuracy isn’t that high. And that the elegant methods they had learnt back in school / academia may not be as elegant any more in industry. I think I’ve written about this, but I can’t find the link now.

Anyway, back to QM2. I remember the professor telling us that three groups would be chosen at random on the day of the assignment submission, and from each of these three groups one person would be chosen at random who would have to present the group’s solution to the class. I remember that the other three people in my group all decided to bunk class that day! In any case, our group wasn’t called to present.

The whole point of this massive build up is – our approach (and the approach of most other groups) had been all wrong. We had just gone in a mad hunt for R square, not bothering to figure out whether the wild transformations and combinations that we were making made any business sense. Moreover, in our mad hunt for R square, we had all forgotten to consider whether a particular variable was significant, and if the regression itself was significant.

What we learnt was that while R square matters, it is not everything. The “model needs to be good”. The variables need to make sense. In statistics you can’t just go about optimising for one metric – there are several others. And this lesson has stuck with me. And guides how I approach all kinds of data modelling work. And I realise that is in conflict with the way data science is widely practiced nowadays.

The way data science is largely practiced in the wild nowadays is precisely a mad hunt for R Square (or area under ROC curve, if you’re doing a classification problem). Whether the variables used make sense doesn’t matter. Whether the transformations are sound doesn’t matter. It doesn’t matter at all whether the model is “good”, or appropriate – the only measure of goodness of the model seems to be the R square!

In a way, contests such as Kaggle have exacerbated this trend. In contests, typically, there is a precise metric (such as R Square) that you are supposed to maximise. With contests being evaluated algorithmically, it is difficult to evaluate on multiple parameters – especially not whether “the model is good”. And since nowadays a lot of data scientists hone their skills by participating in contests such as on Kaggle, they are tuned to simply go R square hunting.

Also, the big difference between Kaggle and real life is that in Kaggle, the model that you build doesn’t matter. It’s just a combination. You get the best R square. You win. You take the prize. You go home.

You don’t need to worry about how the data for the model was collected. The model doesn’t have to be implemented. No business decisions need to be made based on the model. Contest done, model done.

Obviously that is not how things work in real life. Building the model is only one in a long series of steps in solving the business problem. And when you focus too much on just one thing – the model’s accuracy in the data that you have been given, a lot can be lost in the rest of the chain (including application of the model in future situations).

And in this way, by focussing on just a small portion of the entire data science process (model building), I think Kaggle (and other similar competition platforms) has actually done a massive disservice to data science itself.


This is completely unrelated to the rest of the post, but too small to merit a post of its own.

Suppose you ask a software engineer to sort a few datasets. He goes about applying bubble sort, heap sort, quick sort, insertion sort and a whole host of other techniques. And then picks the one that sorted the given datasets fastest.

That’s precisely how it seems “data science” is practiced nowadays

Junior Data Scientists

Since this is a work related post, I need to emphasise that all opinions in this are my own, and don’t reflect that of any organisation / organisations I might be affiliated with

The last-released episode of my Data Chatter podcast is with Abdul Majed Raja, a data scientist at Atlassian. We mostly spoke about R and Python, the two programming languages / packages most used for data science, and spoke about their relative merits and demerits.

While we mostly spoke about R and Python, Abdul’s most insightful comment, in my opinion, had to do with neither. While talking about online tutorials and training, he spoke about how most tutorials related to data science are aimed at the entry level, for people wanting to become data scientists, and that there was very little readymade material to help people become better data scientists.

And from my vantage point, as someone who has been heavily trying to recruit data scientists through the course of this year, this is spot on. A lot of profiles I get (most candidates who apply to my team get put through an open ended assignment) seem uncorrelated with the stated years of experience on their CVs. Essentially, a lot of them just appear “very junior”.

This “juniority”, in most cases, comes through in the way that people have done their assignments. A telltale sign, for example, is an excessive focus on necessary but nowhere sufficient things such as data cleaning, variable transformation, etc. Another telltale sign is the simple application of methods without bothering to explain why the method was chosen in the first place.

Apart from the lack of tutorials around, one reason why the quality of data science profiles continues to remain “junior” could be the organisation of teams themselves. To become better at your job, you need interact with people who are better than you at your job. Unfortunately, the rapid rise in demand for data scientists in the last decade has meant that this peer learning is not always there.

Yes – if you are a bunch of data scientists working together, you can pull each other up. However, if many of you have come in through the same process, it is that much more difficult – there is no benchmark for you.

The other thing is the structure of the teams (I’m saying this with very little data, so call me out if I’m bullshitting) – unlike software engineers, data scientists seldom work in large teams. Sometimes they are scattered across the organisation, largely working with tech or business teams. In any case, companies don’t need that many data scientists. So the number is low to start off with as well.

Another reason is the structure of the market – for the last decade the demand for data scientists has far exceeded the available supply. So that has meant that there is no real reason to upskill – you’ll get a job anyway.

Abdul’s solution, in the absence of tutorials, is for data scientists to look at other people’s code. The R community, for example, has a weekly Tidy Tuesday data challenge, and a lot of people who take that challenge put up their code online. I’m pretty certain similar resources exist for Python (on Kaggle, if not anywhere else).

So for someone who wants to see how other data scientists work and learn from them, there is plenty of resources around.

PS: I want to record a podcast episode on the “pile stirring” epidemic in machine learning (where people simply throw methods at a dataset without really understanding why that should work, or understanding the basic math of different methods). So far I’ve been unable to find a suitable guest. Recommendations welcome.

The Science in Data Science

The science in “data science” basically represents the “scientific method”.

It’s a decade since the phrase “data scientist” got coined, though if you go on LinkedIn, you will find people who claim to have more than two years of experience in the subject.

The origins of the phrase itself are unclear, though some sources claim that it came out of this HBR article in 2012 written by Thomas Davenport and DJ Patil (though, in 2009, Hal Varian, formerly Google’s Chief Economist had said that the “sexiest job of the 21st century” will be that of a statistician).

Some of you might recall that in 2018, I had said that “I’m not a data scientist any more“. That was mostly down to my experience working with companies in London, where I found that data science was used as a euphemism for “machine learning” – something I was incredibly uncomfortable with.

With the benefit of hindsight, it seems like I was wrong. My view on data science being a euphemism for machine learning came from interacting with small samples of people (though it could be an English quirk). As I’ve dug around over the years, it seems like the “science” in data science comes not from the maths in machine learning, but elsewhere.

One phenomenon that had always intrigued me was the number of people with PhDs, especially NOT in maths, computer science of statistics, who have made a career in data science. Initially I dismissed it down to “the gap between PhD and tenure track faculty positions in science”. However, the numbers kept growing.

The more perceptive of you might know that I run a podcast now. It is called “Data Chatter“, and is ten episodes old now. The basic aim of the podcast is for me to have some interesting conversations – and then release them for public benefit. Yeah, yeah.

So, there was this thing that intrigued me, and I have a podcast. I did what you would have expected me to do – get on a guest who went from a science background to data science. I got Dhanya, my classmate from school, to talk about how her background with a PhD in neuroscience has helped her become a better data scientist.

It is a fascinating conversation, and served its primary purpose of making me understand what the “science” in data science really is. I had gone into the conversation expecting to talk about some machine learning, and how that gets used in academia or whatever. Instead, we spoke for an hour about designing experiments, collecting data and testing hypotheses.

The science in “data science” basically represents the “scientific method“. What Dhanya told me (you should listen to the conversation) is that a PhD prepares you for thinking in the scientific method, and drills into you years of practice in it. And this is especially true of “experimental” PhDs.

And then, last night, while preparing the notes for the podcast release, I stumbled upon the original HBR article by Thomas Davenport and DJ Patil talking about “data science”. And I found that they talk about the scientific method as well. And I found that I had talked about it in my newsletter as well – only to forget it later. This is what I had written:

Reading Patil and Davenport’s article carefully suggests, however, that companies might be making a deliberate attempt at recruiting pure science PhDs for data scientist roles.

The following excerpts from the article (which possibly shaped the way many organisations think about data science) can help us understand why PhDs are sought after as data scientists.

  • Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time (Ed: the article was published in late 2012, so we’re almost “five years later” now)
  • Perhaps it’s becoming clear why the word “scientist” fits this emerging role. Experimental physicists, for example, also have to design equipment, gather data, conduct multiple experiments, and communicate their results.
  • Some of the best and brightest data scientists are PhDs in esoteric fields like ecology and systems biology.
  • It’s important to keep that image of the scientist in mind—because the word “data” might easily send a search for talent down the wrong path

Patil and Davenport make it very clear that traditional “data analysts” may not make for great data scientists.

We learn, and we forget, and we re-learn. But learning is precisely what the scientific method, which underpins the “science” in data science, is all about. And it is definitely NOT about machine learning.

Statistical analysis revisited – machine learning edition

Over ten years ago, I wrote this blog post that I had termed as a “lazy post” – it was an email that I’d written to a mailing list, which I’d then copied onto the blog. It was triggered by someone on the group making an off-hand comment of “doing regression analysis”, and I had set off on a rant about why the misuse of statistics was a massive problem.

Ten years on, I find the post to be quite relevant, except that instead of “statistics”, you just need to say “machine learning” or “data science”. So this is a truly lazy post, where I piggyback on my old post, to talk about the problems with indiscriminate use of data and models.

I had written:

there is this popular view that if there is data, then one ought to do statistical analysis, and draw conclusions from that, and make decisions based on these conclusions. unfortunately, in a large number of cases, the analysis ends up being done by someone who is not very proficient with statistics and who is basically applying formulae rather than using a concept. as long as you are using statistics as concepts, and not as formulae, I think you are fine. but you get into the “ok i see a time series here. let me put regression. never mind the significance levels or stationarity or any other such blah blah but i’ll take decisions based on my regression” then you are likely to get into trouble.

The modern version of this is – everybody wants to do “big data” and “data science”. So if there is some data out there, people will want to draw insights from it. And since it is easy to apply machine learning models (thanks to open source toolkits such as the scikit-learn package in Python), people who don’t understand the models indiscriminately apply it on the data that they have got. So you have people who don’t really understand data or machine learning working with those, and creating models that are dangerous.

As long as people have idea of the models they are using, and the assumptions behind them, and the quality of data that goes into the models, we are fine. However, we are increasingly seeing cases of people using improper or biased data and applying models they don’t understand on top of them, that will have impact that affect the wider world.

So the problem is not with “artificial intelligence” or “machine learning” or “big data” or “data science” or “statistics”. It is with the people who use them incorrectly.


Segmentation and machine learning

For best results, use machine learning to do customer segmentation, but then get humans with domain knowledge to validate the segments

There are two common ways in which people do customer segmentation. The “traditional” method is to manually define the axes through which the customers will get segmented, and then simply look through the data to find the characteristics and size of each segment.

Then there is the “data science” way of doing it, which is to ignore all intuition, and simply use some method such as K-means clustering and “do gymnastics” with the data and find the clusters.

A quantitative extreme of this method is to do gymnastics with your data, get segments out of it, and quantitatively “take action” on it without really bothering to figure out what each clusters represent. Loosely speaking, this is how a lot of recommendation systems nowadays work – some algorithm somewhere finds people similar to you based on your behaviour, and recommends to you what they liked.

I usually prefer a sort of middle ground. I like to let the algorithms (k-means easily being my favourite) to come up with the segments based on the data, and then have a bunch of humans look at the segments and make sense of it.

Basically whatever segments are thrown up by the algorithm need to be validated by human intuition. Getting counterintuitive clusters is also not a problem – on several occasions, people I’ve validated the clusters by (usually clients) have used the counterintuitive clusters to discover bugs, gaps in the data  or patterns that they didn’t know of earlier.

Also, in terms of validation of clusters, it is always useful to get people with domain knowledge to validate the clusters. And this also means that whatever clusters you’ve generated you are able to represent them in a human-readable format. The best way of doing that is to use the cluster centres and then represent them somehow in a “physical” manner.

I started writing this post some three days ago and am only getting to finish it now. Unfortunately, in the meantime I’ve forgotten the exact motivation of why I started writing this. If i recall that, I’ll maybe do another post.

Taking Intelligence For Granted

There was a point in time when the use of artificial intelligence or machine learning or any other kind of intelligence in a product was a source of competitive advantage and differentiation. Nowadays, however, many people have got so spoiled by the use of intelligence in many products they use that it has become more of a hygiene factor.

Take this morning’s post, for example. One way to look at it is that Spotify with its customisation algorithms and recommendations has spoiled me so much that I find Amazon’s pushing of Indian music irritating (Amazon’s approach can be called as “naive customisation”, where they push Indian music to me only because I’m based in India, and not learn further based on my preferences).

Had I not been exposed to the more intelligent customisation that Spotify offers, I might have found Amazon’s naive customisation interesting. However, Spotify’s degree of customisation has spoilt me so much that Amazon is simply inadequate.

This expectation of intelligence goes beyond product and service classes. When we get used to Spotify recommending music we like based on our preferences, we hold Netflix’s recommendation algorithm to a higher standard. We question why the Flipkart homepage is not customised to us based on our previous shopping. Or why Google Maps doesn’t learn that some of us don’t like driving through small roads when we can help it.

That customers take intelligence for granted nowadays means that businesses have to invest more in offering this intelligence. Easy-to-use data analysis and machine learning packages mean that at least some part of an industry uses intelligence in at least some form (even if they might do it badly in case they fail to throw human intelligence into the mix!).

So if you are in the business of selling to end customers, keep in mind that they are used to seeing intelligence everywhere around them, and whether they state it or not, they expect it from you.