Modelling for accuracy

Recently I’ve been remembering the first assignment of my “quantitative methods 2” course at IIMB back in 2004. In the first part of that course, we were learning regression. And so this assignment involved a regression problem. Not too hard at first sight – maybe 3 explanatory variables.

We had been randomly divided into teams of four. I remember working on it in the Computer Centre, in close proximity to some other teams. I remember trying to “do gymnastics” – combining variables, transforming them, all in the hope of trying to get the “best possible R square”. From what I remember, most of the groups went “R square hunting” that day. The assignment had been cleverly chosen such that for an academic exercise, the R Square wasn’t very high.

As an aside – one thing a lot of people take a long time to come to terms with is that in “real life” (industry problems) R squares aren’t usually that high. Forecast accuracy isn’t that high. And that the elegant methods they had learnt back in school / academia may not be as elegant any more in industry. I think I’ve written about this, but I can’t find the link now.

Anyway, back to QM2. I remember the professor telling us that three groups would be chosen at random on the day of the assignment submission, and from each of these three groups one person would be chosen at random who would have to present the group’s solution to the class. I remember that the other three people in my group all decided to bunk class that day! In any case, our group wasn’t called to present.

The whole point of this massive build up is – our approach (and the approach of most other groups) had been all wrong. We had just gone in a mad hunt for R square, not bothering to figure out whether the wild transformations and combinations that we were making made any business sense. Moreover, in our mad hunt for R square, we had all forgotten to consider whether a particular variable was significant, and if the regression itself was significant.

What we learnt was that while R square matters, it is not everything. The “model needs to be good”. The variables need to make sense. In statistics you can’t just go about optimising for one metric – there are several others. And this lesson has stuck with me. And guides how I approach all kinds of data modelling work. And I realise that is in conflict with the way data science is widely practiced nowadays.

The way data science is largely practiced in the wild nowadays is precisely a mad hunt for R Square (or area under ROC curve, if you’re doing a classification problem). Whether the variables used make sense doesn’t matter. Whether the transformations are sound doesn’t matter. It doesn’t matter at all whether the model is “good”, or appropriate – the only measure of goodness of the model seems to be the R square!

In a way, contests such as Kaggle have exacerbated this trend. In contests, typically, there is a precise metric (such as R Square) that you are supposed to maximise. With contests being evaluated algorithmically, it is difficult to evaluate on multiple parameters – especially not whether “the model is good”. And since nowadays a lot of data scientists hone their skills by participating in contests such as on Kaggle, they are tuned to simply go R square hunting.

Also, the big difference between Kaggle and real life is that in Kaggle, the model that you build doesn’t matter. It’s just a combination. You get the best R square. You win. You take the prize. You go home.

You don’t need to worry about how the data for the model was collected. The model doesn’t have to be implemented. No business decisions need to be made based on the model. Contest done, model done.

Obviously that is not how things work in real life. Building the model is only one in a long series of steps in solving the business problem. And when you focus too much on just one thing – the model’s accuracy in the data that you have been given, a lot can be lost in the rest of the chain (including application of the model in future situations).

And in this way, by focussing on just a small portion of the entire data science process (model building), I think Kaggle (and other similar competition platforms) has actually done a massive disservice to data science itself.

Tailpiece

This is completely unrelated to the rest of the post, but too small to merit a post of its own.

Suppose you ask a software engineer to sort a few datasets. He goes about applying bubble sort, heap sort, quick sort, insertion sort and a whole host of other techniques. And then picks the one that sorted the given datasets fastest.

That’s precisely how it seems “data science” is practiced nowadays

Junior Data Scientists

Since this is a work related post, I need to emphasise that all opinions in this are my own, and don’t reflect that of any organisation / organisations I might be affiliated with

The last-released episode of my Data Chatter podcast is with Abdul Majed Raja, a data scientist at Atlassian. We mostly spoke about R and Python, the two programming languages / packages most used for data science, and spoke about their relative merits and demerits.

While we mostly spoke about R and Python, Abdul’s most insightful comment, in my opinion, had to do with neither. While talking about online tutorials and training, he spoke about how most tutorials related to data science are aimed at the entry level, for people wanting to become data scientists, and that there was very little readymade material to help people become better data scientists.

And from my vantage point, as someone who has been heavily trying to recruit data scientists through the course of this year, this is spot on. A lot of profiles I get (most candidates who apply to my team get put through an open ended assignment) seem uncorrelated with the stated years of experience on their CVs. Essentially, a lot of them just appear “very junior”.

This “juniority”, in most cases, comes through in the way that people have done their assignments. A telltale sign, for example, is an excessive focus on necessary but nowhere sufficient things such as data cleaning, variable transformation, etc. Another telltale sign is the simple application of methods without bothering to explain why the method was chosen in the first place.

Apart from the lack of tutorials around, one reason why the quality of data science profiles continues to remain “junior” could be the organisation of teams themselves. To become better at your job, you need interact with people who are better than you at your job. Unfortunately, the rapid rise in demand for data scientists in the last decade has meant that this peer learning is not always there.

Yes – if you are a bunch of data scientists working together, you can pull each other up. However, if many of you have come in through the same process, it is that much more difficult – there is no benchmark for you.

The other thing is the structure of the teams (I’m saying this with very little data, so call me out if I’m bullshitting) – unlike software engineers, data scientists seldom work in large teams. Sometimes they are scattered across the organisation, largely working with tech or business teams. In any case, companies don’t need that many data scientists. So the number is low to start off with as well.

Another reason is the structure of the market – for the last decade the demand for data scientists has far exceeded the available supply. So that has meant that there is no real reason to upskill – you’ll get a job anyway.

Abdul’s solution, in the absence of tutorials, is for data scientists to look at other people’s code. The R community, for example, has a weekly Tidy Tuesday data challenge, and a lot of people who take that challenge put up their code online. I’m pretty certain similar resources exist for Python (on Kaggle, if not anywhere else).

So for someone who wants to see how other data scientists work and learn from them, there is plenty of resources around.

PS: I want to record a podcast episode on the “pile stirring” epidemic in machine learning (where people simply throw methods at a dataset without really understanding why that should work, or understanding the basic math of different methods). So far I’ve been unable to find a suitable guest. Recommendations welcome.

The Science in Data Science

The science in “data science” basically represents the “scientific method”.

It’s a decade since the phrase “data scientist” got coined, though if you go on LinkedIn, you will find people who claim to have more than two years of experience in the subject.

The origins of the phrase itself are unclear, though some sources claim that it came out of this HBR article in 2012 written by Thomas Davenport and DJ Patil (though, in 2009, Hal Varian, formerly Google’s Chief Economist had said that the “sexiest job of the 21st century” will be that of a statistician).

Some of you might recall that in 2018, I had said that “I’m not a data scientist any more“. That was mostly down to my experience working with companies in London, where I found that data science was used as a euphemism for “machine learning” – something I was incredibly uncomfortable with.

With the benefit of hindsight, it seems like I was wrong. My view on data science being a euphemism for machine learning came from interacting with small samples of people (though it could be an English quirk). As I’ve dug around over the years, it seems like the “science” in data science comes not from the maths in machine learning, but elsewhere.

One phenomenon that had always intrigued me was the number of people with PhDs, especially NOT in maths, computer science of statistics, who have made a career in data science. Initially I dismissed it down to “the gap between PhD and tenure track faculty positions in science”. However, the numbers kept growing.

The more perceptive of you might know that I run a podcast now. It is called “Data Chatter“, and is ten episodes old now. The basic aim of the podcast is for me to have some interesting conversations – and then release them for public benefit. Yeah, yeah.

So, there was this thing that intrigued me, and I have a podcast. I did what you would have expected me to do – get on a guest who went from a science background to data science. I got Dhanya, my classmate from school, to talk about how her background with a PhD in neuroscience has helped her become a better data scientist.

It is a fascinating conversation, and served its primary purpose of making me understand what the “science” in data science really is. I had gone into the conversation expecting to talk about some machine learning, and how that gets used in academia or whatever. Instead, we spoke for an hour about designing experiments, collecting data and testing hypotheses.

The science in “data science” basically represents the “scientific method“. What Dhanya told me (you should listen to the conversation) is that a PhD prepares you for thinking in the scientific method, and drills into you years of practice in it. And this is especially true of “experimental” PhDs.

And then, last night, while preparing the notes for the podcast release, I stumbled upon the original HBR article by Thomas Davenport and DJ Patil talking about “data science”. And I found that they talk about the scientific method as well. And I found that I had talked about it in my newsletter as well – only to forget it later. This is what I had written:

Reading Patil and Davenport’s article carefully suggests, however, that companies might be making a deliberate attempt at recruiting pure science PhDs for data scientist roles.

The following excerpts from the article (which possibly shaped the way many organisations think about data science) can help us understand why PhDs are sought after as data scientists.

  • Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time (Ed: the article was published in late 2012, so we’re almost “five years later” now)
  • Perhaps it’s becoming clear why the word “scientist” fits this emerging role. Experimental physicists, for example, also have to design equipment, gather data, conduct multiple experiments, and communicate their results.
  • Some of the best and brightest data scientists are PhDs in esoteric fields like ecology and systems biology.
  • It’s important to keep that image of the scientist in mind—because the word “data” might easily send a search for talent down the wrong path

Patil and Davenport make it very clear that traditional “data analysts” may not make for great data scientists.

We learn, and we forget, and we re-learn. But learning is precisely what the scientific method, which underpins the “science” in data science, is all about. And it is definitely NOT about machine learning.

Statistical analysis revisited – machine learning edition

Over ten years ago, I wrote this blog post that I had termed as a “lazy post” – it was an email that I’d written to a mailing list, which I’d then copied onto the blog. It was triggered by someone on the group making an off-hand comment of “doing regression analysis”, and I had set off on a rant about why the misuse of statistics was a massive problem.

Ten years on, I find the post to be quite relevant, except that instead of “statistics”, you just need to say “machine learning” or “data science”. So this is a truly lazy post, where I piggyback on my old post, to talk about the problems with indiscriminate use of data and models.

I had written:

there is this popular view that if there is data, then one ought to do statistical analysis, and draw conclusions from that, and make decisions based on these conclusions. unfortunately, in a large number of cases, the analysis ends up being done by someone who is not very proficient with statistics and who is basically applying formulae rather than using a concept. as long as you are using statistics as concepts, and not as formulae, I think you are fine. but you get into the “ok i see a time series here. let me put regression. never mind the significance levels or stationarity or any other such blah blah but i’ll take decisions based on my regression” then you are likely to get into trouble.

The modern version of this is – everybody wants to do “big data” and “data science”. So if there is some data out there, people will want to draw insights from it. And since it is easy to apply machine learning models (thanks to open source toolkits such as the scikit-learn package in Python), people who don’t understand the models indiscriminately apply it on the data that they have got. So you have people who don’t really understand data or machine learning working with those, and creating models that are dangerous.

As long as people have idea of the models they are using, and the assumptions behind them, and the quality of data that goes into the models, we are fine. However, we are increasingly seeing cases of people using improper or biased data and applying models they don’t understand on top of them, that will have impact that affect the wider world.

So the problem is not with “artificial intelligence” or “machine learning” or “big data” or “data science” or “statistics”. It is with the people who use them incorrectly.

 

Segmentation and machine learning

For best results, use machine learning to do customer segmentation, but then get humans with domain knowledge to validate the segments

There are two common ways in which people do customer segmentation. The “traditional” method is to manually define the axes through which the customers will get segmented, and then simply look through the data to find the characteristics and size of each segment.

Then there is the “data science” way of doing it, which is to ignore all intuition, and simply use some method such as K-means clustering and “do gymnastics” with the data and find the clusters.

A quantitative extreme of this method is to do gymnastics with your data, get segments out of it, and quantitatively “take action” on it without really bothering to figure out what each clusters represent. Loosely speaking, this is how a lot of recommendation systems nowadays work – some algorithm somewhere finds people similar to you based on your behaviour, and recommends to you what they liked.

I usually prefer a sort of middle ground. I like to let the algorithms (k-means easily being my favourite) to come up with the segments based on the data, and then have a bunch of humans look at the segments and make sense of it.

Basically whatever segments are thrown up by the algorithm need to be validated by human intuition. Getting counterintuitive clusters is also not a problem – on several occasions, people I’ve validated the clusters by (usually clients) have used the counterintuitive clusters to discover bugs, gaps in the data  or patterns that they didn’t know of earlier.

Also, in terms of validation of clusters, it is always useful to get people with domain knowledge to validate the clusters. And this also means that whatever clusters you’ve generated you are able to represent them in a human-readable format. The best way of doing that is to use the cluster centres and then represent them somehow in a “physical” manner.

I started writing this post some three days ago and am only getting to finish it now. Unfortunately, in the meantime I’ve forgotten the exact motivation of why I started writing this. If i recall that, I’ll maybe do another post.

Taking Intelligence For Granted

There was a point in time when the use of artificial intelligence or machine learning or any other kind of intelligence in a product was a source of competitive advantage and differentiation. Nowadays, however, many people have got so spoiled by the use of intelligence in many products they use that it has become more of a hygiene factor.

Take this morning’s post, for example. One way to look at it is that Spotify with its customisation algorithms and recommendations has spoiled me so much that I find Amazon’s pushing of Indian music irritating (Amazon’s approach can be called as “naive customisation”, where they push Indian music to me only because I’m based in India, and not learn further based on my preferences).

Had I not been exposed to the more intelligent customisation that Spotify offers, I might have found Amazon’s naive customisation interesting. However, Spotify’s degree of customisation has spoilt me so much that Amazon is simply inadequate.

This expectation of intelligence goes beyond product and service classes. When we get used to Spotify recommending music we like based on our preferences, we hold Netflix’s recommendation algorithm to a higher standard. We question why the Flipkart homepage is not customised to us based on our previous shopping. Or why Google Maps doesn’t learn that some of us don’t like driving through small roads when we can help it.

That customers take intelligence for granted nowadays means that businesses have to invest more in offering this intelligence. Easy-to-use data analysis and machine learning packages mean that at least some part of an industry uses intelligence in at least some form (even if they might do it badly in case they fail to throw human intelligence into the mix!).

So if you are in the business of selling to end customers, keep in mind that they are used to seeing intelligence everywhere around them, and whether they state it or not, they expect it from you.

More on statistics and machine learning

I’m thinking of a client problem right now, and I thought that something that we need to predict can be modelled as a function of a few other things that we will know.

Initially I was thinking about it from the machine learning perspective, and my thought process went “this can be modelled as a function of X, Y and Z. Once this is modelled, then we can use X, Y and Z to predict this going forward”.

And then a minute later I context switched into the statistical way of thinking. And now my thinking went “I think this can be modelled as a function of X, Y and Z. Let me build a quick model to see if the goodness of fit, and whether a signal actually exists”.

Now this might reflect my own biases, and my own processes for learning to do statistics and machine learning, but one important difference I find is that in statistics you are concerned about the goodness of fit, and whether there is a “signal” at all.

While in machine learning as well we look at what the predictive ability is (area under ROC curve and all that), there is a bit of delay in the process between the time we model and the time we look for the goodness of fit. What this means is that sometimes we can get a bit too certain about the models that we want to build without thinking if in the first place they make sense and there’s a signal in that.

For example, in the machine learning world, the concept of R Square is not defined for regression –  the only thing that matters is how well you can predict out of sample. So while you’re building the regression (machine learning) model, you don’t have immediate feedback on what to include and what to exclude and whether there is a signal.

I must remind you that machine learning methods are typically used when we are dealing with really high dimensional data, and where the signal usually exists in the interplay between explanatory variables rather than in a single explanatory variable. Statistics, on the other hand, is used more for low dimensional problems where each variable has reasonable predictive power by itself.

It is possibly a quirk of how the two disciplines are practiced that statistics people are inherently more sceptical about the existence of signal, and machine learning guys are more certain that their model makes sense.

What do you think?

Good vodka and bad chicken

When I studied Artificial Intelligence, back in 2002, neural networks weren’t a thing. The limited compute capacity and storage available at that point in time meant that most artificial intelligence consisted of what is called “rule based methods”.

And as part of the course we learnt about machine translation, and the difficulty of getting the implicit meaning across. The favourite example by computer scientists in that time was the story of how some scientists translated “the spirit is willing but the flesh is weak” into Russian using an English-Russian translation software, and then converted it back into English using a Russian-English translation software.

The result was “the vodka is excellent but the chicken is not good”.

While this joke may not be valid any more thanks to the advances in machine translation, aided by big data and neural networks, the issue of translation is useful in other contexts.

Firstly, speaking in a language that is not your “technical first language” makes you eschew jargon. If you have been struggling to get rid of jargon from your professional vocabulary, one way to get around it is to speak more in your native language (which, if you’re Indian, is unlikely to be your technical first language). Devoid of the idioms and acronyms that you normally fill your official conversation with, you are forced to think, and this practice of talking technical stuff in a non-usual language will help you cut your jargon.

There is another use case for using non-standard languages – dealing with extremely verbose prose. A number of commentators, a large number of whom are rather well-reputed, have this habit of filling their columns with flowery language, GRE words, repetition and rhetoric. While there is usually some useful content in these columns, it gets lost in the language and idioms and other things that would make the columnist’s high school English teacher happy.

I suggest that these columns be given the spirit-flesh treatment. Translate them into a non-English language, get rid of redundancies in sentences and then  translate them back into English. This process, if the translators are good at producing simple language, will remove the bluster and make the column much more readable.

Speaking in a non-standard language can also make you get out of your comfort zone and think harder. Earlier this week, I spent two hours recording a podcast in Hindi on cricket analytics. My Hindi is so bad that I usually think in Kannada or English and then translate the sentence “live” in my head. And as you can hear, I sometimes struggle for words. Anyway here is the thing. Listen to this if you can bear to hear my Hindi for over an hour.

Ticking all the boxes

Last month my Kindle gave up. It refused to take charge, only heating up the  charging cable (and possibly destroying an Android charger) in the process. This wasn’t the first time this was happening.

In 2012, my first Kindle had given up a few months after I started using it, with its home button refusing to work. Amazon had sent me a new one then (I’d been amazed at the no-questions-asked customer-centric replacement process). My second Kindle (the replacement) developed problems in 2016, which I made worse by trying to pry it open with a knife. After I had sufficiently damaged it, there was no way I could ask Amazon to do anything about it.

Over the last year, I’ve discovered that I read much faster on my Kindle than in print – possibly because it allows me to read in the dark, it’s easy to hold, I can read without distractions (unlike phone/iPad) and it’s easy on the eye. I possibly take half the time to read on a Kindle what I take to read in print. Moreover, I find the note-taking and highlighting feature invaluable (I never made a habit of taking notes on physical books).

So when the kindle stopped working I started wondering if I might have to go back to print books (there was no way I would invest in a new Kindle). Customer care confirmed that my Kindle was out of warranty, and after putting me on hold for a long time, gave me two options. I could either take a voucher that would give me 15% off on a new Kindle, or the customer care executive could “talk to the software engineers” to see if they could send me a replacement (but there was no guarantee).

Since I had no plans of buying a new Kindle, I decided to take a chance. The customer care executive told me he would get back to me “within 24 hours”. It took barely an hour for him to call me back, and a replacement was in my hands in 2 days.

It got me wondering what “software engineers” had to do with the decision to give me a replacement (refurbished) Kindle. Shortly I realised that Amazon possibly has an algorithm to determine whether to give a replacement Kindle for those that have gone kaput out of warranty. I started  trying to guess what such an algorithm might look like.

The interesting thing is that among all the factors that I could list out based on which Amazon might make a decision to send me a new Kindle, there was not one that would suggest that I shouldn’t be given a replacement. In no particular order:

  • I have been an Amazon Prime customer for three years now
  • I buy a lot of books on the Kindle store. I suspect I’ve purchased books worth more than the cost of the Kindle in the last year.
  • I read heavily on the Kindle
  • I don’t read Kindle books on other apps (phone / iPad / computer)
  • I haven’t bought too many print books from Amazon. Most of the print books I’ve bought have been gifts (I’ve got them wrapped)
  • My Goodreads activity suggests that I don’t read much outside of what I’ve bought from the Kindle store

In hindsight, I guess I made the correct decision of letting the “software engineers” determine whether I qualify for a new Kindle. I guess Amazon figured that had they not sent me a new Kindle, there was a significant amount of low-marginal-cost sales that they were going to lose!

I duly rewarded them with two book purchases on the Kindle store in the course of the following week!

Human, Animal and Machine Intelligence

Earlier this week I started watching this series on Netflix called “Terrorism Close Calls“. Each episode is about an instance of attempted terrorism that has been foiled in the last 2 decades. For example, there is one example of the plot to bomb a set of transatlantic flights from London to North America in 2006 (a consequence of which is that liquids still aren’t allowed on board flights).

So the first episode of the series involves this Afghani guy who drives all the way from Colorado to New York to place a series of bombs in the latter’s subways (metro train system). He is under surveillance through the length of his journey, and just as he is about to enter New York, he is stopped for what seems like a “routine drugs test”.

As the episode explains, “a set of dogs went around his car sniffing”, but “rather than being trained to sniff drugs” (as is routine in such a stop), “these dogs had been trained to sniff explosives”.

This little snippet got me thinking about how machines are “trained” to “learn”. At the most basic level, machine learning involves showing a large number of “positive cases” and “negative cases” based on which the program “learns” the differences between the positive and negative cases, and thus to identify the positive cases.

So if you want to built a system to identify cats in an image, you feed the machine a large number of images with cats in them, and a large(r) number of images without cats in them, each appropriately “labelled” (“cat” or “no cat”) and based on the differences, the system learns to identify cats.

Similarly, if you want to teach a system to detect cancers based on MRIs, you show it a set of MRIs that show malignant tumours, and another set of MRIs without malignant tumours, and sure enough the machine learns to distinguish between the two sets (you might have come across claims of “AI can cure cancer”. This is how it does it).

However, AI can sometimes go wrong by learning the wrong things. For example, an algorithm trained to recognise sheep started classifying grass as “sheep” (since most of the positive training samples had sheep in meadows). Another system went crazy in its labelling when an unexpected object (an elephant in a drawing room) was present in the picture.

While machines learn through lots of positive and negative examples, that is not how humans learn, as I’ve been observing as my daughter grows up. When she was very little, we got her a book with one photo each of 100 different animals. And we would sit with her every day pointing at each picture and telling her what each was.

Soon enough, she could recognise cats and dogs and elephants and tigers. All by means of being “trained on” one image of each such animal. Soon enough, she could recognise hitherto unseen pictures of cats and dogs (and elephants and tigers). And then recognise dogs (as dogs) as they passed her on the street. What absolutely astounded me was that she managed to correctly recognise a cartoon cat, when all she had seen thus far were “real cats”.

So where do animals stand, in this spectrum of human to machine learning? Do they recognise from positive examples only (like humans do)? Or do they learn from a combination of positive and negative examples (like machines)? One thing that limits the positive-only learning for animals is the limited range of their communication.

What drives my curiosity is that they get trained for specific things – that you have dogs to identify drugs and dogs to identify explosives. You don’t usually have dogs that can recognise both (specialisation is for insects, as they say – or maybe it’s for all non-human animals).

My suspicion (having never had a pet) is that the way animals learn is closer to how humans learn – based on a large number of positive examples, rather than as the difference between positive and negative examples. Just that the limit of the animal’s communication being limited means that it is hard to train them for more than one thing (or maybe there’s something to do with their mental bandwidth as well. I don’t know).

What do you think? Interestingly enough, there is a recent paper that talks about how many machine learning systems have “animal-like abilities” rather than coming close to human intelligence.

For millions of years, mankind lived, just like the animals.
And then something happened that unleashed the power of our imagination. We learned to talk
– Stephen Hawking, in the opening of a Roger Waters-less Pink Floyd’s Keep Talking