Code Density

As many of the regular readers of this blog know, I largely use R for most of my data work. There have been a few occasions when I’ve tried to use Python, but have found that I’m far less efficient in that than I am with R, and so abandoned it, despite the relative ease of putting things into production.

Now in my company, like in most companies, people use both Python and R (the team that reports to me largely uses R, everyone else largely uses Python). And while till recently I used to claim that I’m multilingual in the sense that I can read Python code fairly competently, of late I’m not sure I am. I find it increasingly difficult to parse and read production grade python code.

And now, after some experiments with ChatGPT, and exploring other people’s codes, I have an idea on why I’m finding it hard to read production-grade Python code. It has to do with “code density”.

Of late I’ve been experimenting with Spark (finally, in this job I do a lot of “big data” work – something I never had to in my consulting career prior to this). Related to this, I was reading someone’s PySpark code.

And then it hit me – the problem (rather, my problem) with Python is that it is far more verbose than R. The number of characters or lines of code required to do something in Python is far more than what you need in R (especially if you are using the tidyverse family of packages, which I do, including sparklyr for spark).

Why does the density of code matter? It is to do with aesthetics and modularity and ease of understanding.

Yesterday I was writing some code that I plan to put into production. After a few hours of coding, I looked at the code and felt disgusted with myself – it was a way too long monolithic block of code. It might have been good when I was writing it, but I knew that if I were to revisit it in a week or two, I wouldn’t be able to understand what the hell was happening there.

I’ve never worked as a professional software engineer, but with the amount of coding I’ve done, I’ve worked out what is a “reasonable length for a code block”. It’s like that apocryphal story of Indian public examiners for high school exams who evaluate history answers based on how long they are – “if they were to place an ordinary Reynolds 045 pen vertically on the sheet, the answer should be longer than that for the student to get five marks”.

An answer in a high school history exam needs to be longer than this. A code block or function should be shorter than this

It’s the reverse here. Approximately speaking, if you were to place a Reynolds pen vertically on screen (at your normal font size), no function (or code block) can be longer than the pen.

This easily approximates how much the eye can see on one normal Macbook screen (I use a massive external monitor at work, and a less massive, but equally wide, one at home). If you have to keep scrolling up and down to understand the full logic, there is a higher chance you might make mistakes, and higher difficulty for someone to understand the code.

Till recently (as in earlier this week) I would crib like crazy that people coding in Python would make their code “too modular”. That I would have to keep switching back and forth between different functions (and classes!!) to make sense of some logic, and about how that would make codes hard to debug (I still think there is a limit to how modular you can make your ETL code).

Now, however (I’m writing this on a Saturday – I’m not working today), from the code density perspective, it all makes sense to me.

The advantage of R is that because the code is far denser, you can pack in a far greater amount of logic in a Reynolds pen length of code. So over the years I’ve gotten used to having this much logic being presented to me in one chunk (without having to scroll or switch functions).

The relatively lower density of Python, however, means that the same amount of logic that would be one function in R is now split across several different functions. It is not that the writer of the code is “making this too modular” or “writing functions just for the heck of it”. It is just that their “mental Reynolds pens” doesn’t allow them to pack in more lines in a chunk or function, and Python’s density means there is only so much logic that can go in there.

As part of my undergrad, I did a course on Software Engineering (and the one thing the professor told us then was that we should NOT take up software engineering as a career – “it’s a boring job”, he had said). In that, one of the things we learnt was that in conventional software services contexts, billing would happen as a (nonlinear) function of “kilo lines of code” (this was in early 2003).

Now, looking back, one thing I can say is that the rate per kilo line of R code ought to be much higher than the rate per kilo line of Python code.

Cross posted on my now-largely-dormant Art of Data Science newsletter

Recreating Tufte, and Bangalore weather

For most of my life, I pretty much haven’t understood what the point of “recreating” is. For example, in school if someone says they were going to “act out ______’s _____” I would wonder what the point of it was – that story is well known so they might as well do something more creative.

Later on in life, maybe some 12-13 years back, I discovered the joy in “retelling known stories” – since everyone knows the story you can be far more expressive in how you tell it. Still, however, just “re-creation” (not recreation) never really fascinated me. Most of the point of doing things is to do them your way, I’ve believed (and nowadays, if you think of it, most re-creating can be outsourced to a generative AI).

And the this weekend that changed. On Saturday, I made the long-pending trip to Blossom (helped that daughter had a birthday party to attend nearby), and among other things, I bought Edward Tufte’s classic “The Visual Display of Quantitative Information“. I had read a pirated PDF of this a decade ago (when I was starting out in “data science”), but always wanted the “real thing”.

And this physical copy, designed by Tufte himself, is an absolute joy to read. And I’m paying more attention to the (really beautiful) graphics. So, when I came across this chart of New York weather, I knew I had to recreate it.

A few months earlier, I had dowloaded the dataset for Bangalore’s hourly temperature and rainfall since 1981 (i.e. a bit longer than my own life). This dataset ended in November 2022, but I wasn’t concerned. Basically, this is such a large and complex dataset that so far I had been unable to come up with an easy way to visualise it. So, when I saw this thing from Tufte, recreating would be a good idea.

I spent about an hour and half yesterday doing this. I’ve ignored the colour schemes and other “aesthetic” stuff (just realised I’ve not included the right axis in my re-creation). But I do think I’ve got something fairly good.

My re-creation of Tufte’s New York weather map, in the context of Bangalore in 2022

2022 was an unusual weather year for Bangalore and it shows in this graph. May wasn’t as hot as usual, and there were some rather cold days. Bangalore recorded its coldest October and November days since the 90s (though as this graph shows, not a record by any means). It was overall a really wet year, constantly raining from May to November. The graph shows all of it.

Also if you look at the “noraml pattern” and the records, you see Bangalore’s unusual climate (yes, I do mean “climate” and not “weather” here). Thanks to the monsoons (and pre-monsoons), April is the hottest month. Summer, this year, has already started – in the afternoons it is impossible to go out now. The minimum temperatures are remarkably consistent through the year (so except early in the mornings, you pretty much NEVER need a sweater here – at least I haven’t after I moved back from London).

There is so much more I can do. I’m glad to have come across a template to analyse the data using. Whenever I get the enthu (you know what this website is called) I’ll upload my code to produce this graph onto github or something. And when I get more enthu, I’ll make it aesthetically similar to Tufte’s graph (and include December 2022 data as well).

 

More on CRM

On Friday afternoon, I got a call on my phone. It was  “+91 9818… ” number, and my first instinct was it was someone at work (my company is headquartered in Gurgaon), and I mentally prepared a “don’t you know I’m on vacation? can you call me on Monday instead” as I picked the call.

It turned out to be Baninder Singh, founder of Savorworks Coffee. I had placed an order on his website on Thursday, and I half expected him to tell me that some of the things I had ordered were out of stock.

“Karthik, for your order of the Pi?anas, you have asked for an Aeropress grind. Are you sure of this? I’m asking you because you usually order whole beans”, Baninder said. This was a remarkably pertinent observation, and an appropriate question from a seller. I confirmed to him that this was indeed deliberate (this smaller package is to take to office along with my Aeropress Go), and thanked him for asking. He went on to point out that one of the other coffees I had ordered had very limited stocks, and I should consider stocking up on it.

Some people might find this creepy (that the seller knows exactly what you order, and notices changes in your order), but from a more conventional retail perspective, this is brilliant. It is great that the seller has accurate information on your profile, and is able to detect any anomalies and alert you before something goes wrong.

Now, Savorworks is a small business (a Delhi based independent roastery), and having ordered from them at least a dozen times, I guess I’m one of their more regular customers. So it’s easy for them to keep track and take care of me.

It is similar with small “mom-and-pop” stores. Limited and high-repeat clientele means it’s easy for them to keep track of them and look after them. The challenge, though, is how do you scale it? Now, I’m by no means the only person thinking about this problem. Thousands of business people and data scientists and retailers and technology people and what not have pondered this question for over a decade now. Yet, what you find is that at scale you are simply unable to provide the sort of service you can at small scale.

In theory it should be possible for an AI to profile customers based on their purchases, adds to carts, etc. and then provide them customised experiences. I’m sure tonnes of companies are already trying to do this. However, based on my experience I don’t think anyone is doing this well.

I might sound like a broken record here, but my sense is that this is because the people who are building the algos are not the ones who are thinking of solving the business problems. The algos exist. In theory, if I look at stuff like stable diffusion or Chat GPT (both of which I’ve been playing around with extensively in the last 2 days), algorithms for stuff like customer profiling shouldn’t be THAT hard. The issue, I suspect, is that people have not been asking the right questions of the algos.

On one hand, you could have business people looking at patterns they have divined themselves and then giving precise instructions to the data scientists on how to detect them – and the detection of these patterns would have been hard coded. On the other, the data scientists would have had a free hand and would have done some unsupervised stuff without much business context. And both approaches lead to easily predictable algos that aren’t particularly intelligent.

Now I’m thinking of this as a “dollar bill on the road” kind of a problem. My instinct tells me that “solution exists”, but my other instinct tells that “if a solution existed someone would have found it given how many companies are working on this kind of thing for so long”.

The other issue with such algos it that the deeper you get in prediction the harder it is. At the cohort (of hundreds of users) level, it should not be hard to profile. However, at the personal user level (at which the results of the algos are seen by customers) it is much harder to get right. So maybe there are good solutions but we haven’t yet seen it.

Maybe at some point in the near future, I’ll take another stab at solving this kind of problem. Until then, you have human intelligence and random algos.

 

Alcohol, dinner time and sleep

A couple of months back, I presented what I now realise is a piece of bad data analysis. At the outset, there is nothing special about this – I present bad data analysis all the time at work. In fact, I may even argue that as a head of Data Science and BI, I’m entitled to do this. Anyway, this is not about work.

In that piece, I had looked at some of the data I’ve been diligently collecting about myself for over a year, correlated it with the data collected through my Apple Watch, and found a correlation that on days I drank alcohol, my sleeping heart rate average was higher.

And so I had concluded that alcohol is bad for me. Then again, I’m an experimenter so I didn’t let that stop me from having alcohol altogether. In fact, if I look at my data, the frequency of having alcohol actually went up after my previous blog post, though for a very different reason.

However, having written this blog post, every time I drank, I would check my sleeping heart rate the next day. Most days it seemed “normal”. No spike due to the alcohol. I decided it merited more investigation – which I finished yesterday.

First, the anecdotal evidence – what kind of alcohol I have matters. Wine and scotch have very little impact on my sleep or heart rate (last year with my Ultrahuman patch I’d figured that they had very little impact on blood sugar as well). Beer, on the other hand, has a significant (negative) impact on heart rate (I normally don’t drink anything else).

Unfortunately this data point (what kind of alcohol I drank or how much I drank) I don’t capture in my daily log. So it is impossible to analyse it scientifically.

Anecdotally I started noticing another thing – all the big spikes I had reported in my previous blogpost on the topic were on days when I kept drinking (usually with others) and then had dinner very late. Could late dinner be the cause of my elevated heart rate? Again, in the days after my previous blogpost, I would notice that late dinners would lead to elevated sleeping heart rates  (even if I hadn’t had alcohol that day). Looking at my nightly heart rate graph, I could see that the heart rate on these days would be elevated in the early part of my sleep.

The good news is this (dinner time) is a data point I regularly capture. So when I finally got down to revisiting the analysis yesterday, I had a LOT of data to work with. I won’t go into the intricacies of the analysis (and all the negative results) here. But here are the key insights.

If I regress my resting heart rate against the binary of whether I had alcohol the previous day, I get a significant regression, with a R^2 of 6.1% (i.e. whether I had alcohol the previous day or not explains 6.1% of the variance in my sleeping heart rate). If I have had alcohol the previous day, my sleeping heart rate is higher by about 2 beats per minute on average.

Call:
lm(formula = HR ~ Alcohol, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6523 -2.6349 -0.3849  2.0314 17.5477 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  69.4849     0.3843 180.793  < 2e-16 ***
AlcoholYes    2.1674     0.6234   3.477 0.000645 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.957 on 169 degrees of freedom
Multiple R-squared:  0.06676,   Adjusted R-squared:  0.06123 
F-statistic: 12.09 on 1 and 169 DF,  p-value: 0.000645

Then I regressed my resting heart rate on dinner time (expressed in hours) alone. Again a significant regression but with a much higher R^2 of 9.7%. So what time I have dinner explains a lot more of the variance in my resting heart rate than whether I’ve had alcohol. And each hour later I have my dinner, my sleeping heart rate that night goes up by 0.8 bpm.

Call:
lm(formula = HR ~ Dinner, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6047 -2.4551 -0.0042  2.0453 16.7891 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  54.7719     3.5540  15.411  < 2e-16 ***
Dinner        0.8018     0.1828   4.387 2.02e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.881 on 169 degrees of freedom
Multiple R-squared:  0.1022,    Adjusted R-squared:  0.09693 
F-statistic: 19.25 on 1 and 169 DF,  p-value: 2.017e-05

Finally, for the sake of completeness, I regressed with both. The interesting thing is the adjusted R^2 pretty much added up – giving me > 16% now (so effectively the two (dinner time and alcohol) are uncorrelated). The coefficients are pretty much the same once again.

Call:
lm(formula = HR ~ Dinner, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6047 -2.4551 -0.0042  2.0453 16.7891 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  54.7719     3.5540  15.411  < 2e-16 ***
Dinner        0.8018     0.1828   4.387 2.02e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.881 on 169 degrees of freedom
Multiple R-squared:  0.1022,    Adjusted R-squared:  0.09693 
F-statistic: 19.25 on 1 and 169 DF,  p-value: 2.017e-05

So the takeaway is simple – alcohol might be okay, but have dinner at my regular time (~ 6pm). Also – if I’m going out drinking, I better finish my dinner and go. And no – having beer won’t work – it is going to be another dinner in itself. So stick to wine or scotch.

I must mention things I analysed against and didn’t find significant – whether I have coffee, what time I sleep, the time gap between dinner time and sleep time – all of these have no impact on my resting heart rate. All that matters is alcohol and when I have dinner.

And the last one is something I should never compromise on.

 

 

 

Aamir Khan and Alcohol Buddies

Over the weekend I was watching Koffee with Karan, the episode featuring Aamir Khan and Kareena Kapoor. It was one of the better episodes in the season, along with the one featuring Ranveer Singh and Alia Bhatt (I did not finish watching any of the others, they were damn boring).

The thing with Koffee With Karan is that it is highly dependent on how interesting the guests are, and not all bollywood stars are equally interesting. Even in this episode, Kareena Kapoor came off as a bit of a bore, refusing to answer most questions, but Aamir Khan was great.

In the early part of the episode, both Kareena and Karan accused Aamir of being “boring”. “You come to a party stand alone and just leave; You catch one or two people and just hang out only with them for the full party”, they said. And then a bit later, one of them (I now forget who – possibly Kareena) said “when I meet you in small groups of 5-6 or less you talk a lot and you are such an interesting person, but why is it that you are such a bore at parties?”

Then Aamir went on to talk about a party at Karan’s house where the music was so loud everyone had to shout to be heard. Nobody was dancing to the music. Nothing was happening. “What is the point of such a party?” he asked.

My friend Hari The Kid has this concept of “alcohol buddies”. These are basically people who you can hang out with only if at least one of you is drunk (there are some extreme cases who are so difficult to hang out with that the only way to do it is for BOTH of you to be drunk). The idea is that if both of you are sober there is nothing really to talk about and you will easily get bored. But hey, these are your friends so you need to hang out with them, and the easiest way of doing so is to convert them into alcohol buddies.

Bringing together this concept and Aamir Khan being “boring”, we can classify people into two kinds – those that are fun when drunk, and those that are fun when sober (some, I think, are both). And people who prefer to have fun when drunk consider the sober sorts boring, and people who prefer to have fun sober think the “alcohol buddies” are boring.

Aamir, for example, appears to be a “have fun when sober” guy, who likes to hang out in small groups and make interesting conversation. Most of Bollywood, however, doesn’t seem to operate that way, hanging out in large groups and not really bothering about conversation.

Yesterday, my wife and I were talking, after an event, about how if you are the sort that likes to hang out in small groups and make conversations, large parties can be rather boring. The problem is that you would have just about started making a nice conversation with someone, when someone else will butt in (hey, this is a party, so this is allowed) and change the topic massively or massively bring down the interest level in the conversation. Every conversation ultimately goes down to its lowest common denominator, leaving you rather frustrated.

And if you are the types who likes large parties and alcohol buddies, small conversations will drain you. You struggle to find things to talk about, and there are only so many people to talk to.

PS: Alcohol and good conversations are not mutually exclusive. Some of my best conversations have happened in very small groups, massively fuelled by alcohol. That said, these have largely been with people I can have great conversations with even when everyone is sober.

A day at an award function

So I got an award today. It is called “exemplary data scientist”, and was given out by the Analytics India Magazine as part of their MachineCon 2022. I didn’t really do anything to get the award, apart from existing in my current job.

I guess having been out of the corporate world for nearly a decade, I had so far completely missed out on the awards and conferences circuit. I would see old classmates and colleagues put pictures on LinkedIn collecting awards. I wouldn’t know what to make of it when my oldest friend would tell me that whenever he heard “eye of the tiger”, he would mentally prepare to get up and go receive an award (he got so many I think). It was a world alien to me.

Parallelly, I used to crib about how while I’m well networked in India, and especially in Bangalore, my networking within the analytics and data science community is shit. In a way, I was longing for physical events to remedy this, and would lament that the pandemic had killed those.

So I was positively surprised when about a month ago Analytics India Magazine wrote to me saying they wanted to give me this award, and it would be part of this in-person conference. I knew of the magazine, so after asking around a bit on legitimacy of such awards and looking at who had got it the last time round, I happily accepted.

Most of the awardees were people like me – heads of analytics or data science at some company in India. And my hypothesis that my networking in the industry was shit was confirmed when I looked at the list of attendees – of 100 odd people listed on the MachineCon website, I barely knew 5 (of which 2 didn’t turn up at the event today).

Again I might sound like a n00b, but conferences like today are classic two sided markets (read this eminently readable paper on two sided markets and pricing of the same by Jean Tirole of the University of Toulouse). On the one hand are awardees – people like me and 99 others, who are incentivised to attend the event with the carrot of the award. On the other hand are people who want to meet us, who will then pay to attend the event (or sponsor it; the entry fee for paid tickets to the event was a hefty $399).

It is like “ladies’ night” that pubs have, where on a particular days of the week, women who go to the pub get a free drink. This attracts women, which in turn attracts men who seek to court the women. And what the pub spends in subsidising the women it makes back in terms of greater revenue from the men on the night.

And so it was at today’s conference. I got courted by at least 10 people, trying to sell me cloud services, “AI services on the cloud”, business intelligence tools, “AI powered business intelligence tools”, recruitment services and the like. Before the conference, I had received LinkedIn requests from a few people seeking to sell me stuff at the conference. In the middle of the conference, I got a call from an organiser asking me to step out of the hall so that a sponsor could sell to me.

I held a poker face with stock replies like “I’m not the person who makes this purchasing decision” or “I prefer open source tools” or “we’re building this in house”.

With full benefit of hindsight, Radisson Blu in Marathahalli is a pretty good conference venue. An entire wing of the ground floor of the hotel is dedicated for events, and the AIM guys had taken over the place. While I had not attended any such event earlier, it had all the markings of a well-funded and well-organised event.

As I entered the conference hall, the first thing that struck me was the number of people in suits. Most people were in suits (though few wore ties; And as if the conference expected people to turn up in suits, the goodie bag included a tie, a pair of cufflinks and a pocket square). And I’m just not used to that. Half the days I go to office in shorts. When I feel like wearing something more formal, I wear polo T-shirts with chinos.

My colleagues who went to the NSE last month to ring the bell to take us public all turned up company T-shirts and jeans. And that’s precisely what I wore to the conference today, though I had recently procured a “formal uniform” (polo T-shirt with company logo, rather than my “usual uniform” which is a round neck T-shirt). I was pretty much the only person there in “uniform”. Towards the end of the day, I saw one other guy in his company shirt, but he was wearing a blazer over it!

Pretty soon I met an old acquaintance (who I hadn’t known would be at the conference). He introduced me to a friend, and we went for coffee. I was eating a cookie with the coffee, and had an insight – at conferences, you should eat with your left hand. That way, you don’t touch the food with the same hand you use to touch other people’s hands (surprisingly I couldn’t find sanitiser dispensers at the venue).

The talks, as expected, were nothing much to write about. Most were by sponsors selling their wares. The one talk that wasn’t by a sponsor was delivered by a guy who was introduced as “his greatgrandfather did this. His grandfather did that. And now this guy is here to talk about ethics of AI”. Full Challenge Gopalakrishna feels happened (though, unfortunately, the Kannada fellows I’d hung out with earlier that day hadn’t watched the movie).

I was telling some people over lunch (which was pretty good) that talking about ethics in AI at a conference has become like worshipping Ganesha as part of any elaborate pooja. It has become the de riguer thing to do. And so you pay obeisance to the concept and move on.

The awards function had three sections. The first section was for “users of AI” (from what I understood). The second (where I was included) was for “exemplary data scientists”. I don’t know what the third was for (my wife is ill today so I came home early as soon as I’d collected my award), except that it would be given by fast bowler and match referee Javagal Srinath. Most of the people I’d hung out with through the day were in the Srinath section of the awards.

Overall it felt good. The drive to Marathahalli took only 45 minutes each way (I drove). A lot of people had travelled from other cities in India to reach the venue. I met a few new people. My networking in data science and analytics is still not great, but far better than it used to be. I hope to go for more such events (though we need to figure out how to do these events without that talks).

PS: Everyone who got the award in my section was made to line up for a group photo. As we posed with our awards, an organiser said “make sure all of you hold the prizes in a way that the Intel (today’s chief sponsor) logo faces the camera”. “I guess they want Intel outside”, I joked. It seemed to be well received by the people standing around me. I didn’t talk to any of them after that, though.

The “intel outside” pic. Courtesy: https://www.linkedin.com/company/analytics-india-magazine/posts/?feedView=all

 

Legacy Metrics

Yesterday (or was it the day before? I’ve lost track of time with full time WFH now) the Times of India Bangalore edition had two headlines.

One was the Karnataka education minister BC Nagesh talking about deciding on school closures on a taluk (sub-district) wise basis. “We don’t want to take a decision for the whole state. However, in taluks where test positivity is more than 5%, we will shut schools”, he said.

That was on page one.

And then somewhere inside the newspaper, there was another article. The Indian Council for Medical Research has recommended that “only symptomatic patients should be tested for Covid-19”. However, for whatever reason, Karnataka had decided to not go by this recommendation, and instead decided to ramp up testing.

These two articles are correlated, though the paper didn’t say they were.

I should remind you of one tweet, that I elaborated about a few days back:

 

The reason why Karnataka has decided to ramp up testing despite advisory to the contrary is that changing policy at this point in time will mess with metrics. Yes, I stand by my tweet that test positivity ratio is a shit metric. However, with the government having accepted over the last two years that it is a good metric, it has become “conventional wisdom”. Everyone uses it because everyone else uses it. 

And so you have policies on school shutdowns and other restrictive measures being dictated by this metric – because everyone else uses the same metric, using this “cannot be wrong”. It’s like the old adage that “nobody got fired for hiring IBM”.

ICMR’s message to cut testing of asymptomatic individuals is a laudable one – given that an overwhelming number of people infected by the incumbent Omicron variant of covid-19 have no symptoms at all. The reason it has not been accepted is that it will mess with the well-accepted metric.

If you stop testing asymptomatic people, the total number of tests will drop sharply. The people who are ill will get themselves tested anyways, and so the numerator (number of positive reports) won’t drop. This means that the ratio will suddenly jump up.

And that needs new measures – while 5% is some sort of a “critical number” now (like it is with p-values), the “critical number” will be something else. Moreover, if only symptomatic people are to be tested, the number of tests a day will vary even more – and so the positivity ratio may not be as stable as it is now.

All kinds of currently carefully curated metrics will get messed up. And that is a big problem for everyone who uses these metrics. And so there will be pushback.

Over a period of time, I expect the government and its departments to come up alternate metrics (like how banks have now come up with an alternative to LIBOR), after which the policy to cut testing for asymptomatic people will get implemented. Until then, we should bow to the “legacy metric”.

And if you didn’t figure out already, legacy metrics are everywhere. You might be the cleverest data scientist going around and you might come up with what you think might be a totally stellar metric. However, irrespective of how stellar it is, that people have to change their way of thinking and their process to process it means that it won’t get much acceptance.

The strategy I’ve come to is to either change the metric slowly, in stages (change it little by little), or to publish the new metric along with the old one. Depending on how clever the new metric is, one of the metrics will die away.

Metrics

Over the weekend, I wrote this on twitter:

 

Surprisingly (at the time of writing this at least), I haven’t got that much abuse for this tweet, considering how “test positivity” has been held as the gold standard in terms of tracking the pandemic by governments and commentators.

The reason why I say this is a “shit metric” is simple – it doesn’t give that much information. Let’s think about it.

For a (ratio) metric to make sense, both the numerator and the denominator need to be clearly defined, and there needs to be clear information content in the ratio. In this particular case, both the numerator and the denominator are clear – latter is the number of people who got Covid tests taken, and the former is the number of these people who returned a positive test.

So far so good. Apart from being an objective measure, test positivity ratio is  also a “ratio”, and thus normalised (unlike absolute number of positive tests).

So why do I say it doesn’t give much information? Because of the information content.

The problem with test positivity ratio is the composition of the denominator (now we’re getting into complicated territory). Essentially, there are many reasons why people get tested for Covid-19. The most obvious reason to get tested is that you are ill. Then, you might get tested when a family member is ill. You might get tested because your employer mandates random tests. You might get tested because you have to travel somewhere and the airline requires it. And so on and so forth.

Now, for each of these reasons for getting tested, we can define a sort of “prior probability of testing positive” (based on historical averages, etc). And the positivity ratio needs to be seen in relation to this prior probability. For example, in “peaceful times” (eg. Bangalore between August and November 2021), a large proportion of the tests would be “random” – people travelling or employer-mandated. And this would necessarily mean a low test positivity.

The other extreme is when the disease is spreading rapidly – few people are travelling or going physically to work. Most of the people who get tested are getting tested because they are ill. And so the test positivity ratio will be rather high.

Basically – rather than the ratio telling you how bad the covid situation is in a region, it is influenced by how bad the covid situation is. You can think of it as some sort of a Schrödinger-ian measurement.

That wasn’t an offhand comment. Because government policy is an important input into test positivity ratio. For example, take “contact tracing”, where contacts of people who have tested positive are hunted down and also tested. The prior probability of a contact of a covid patient testing positive is far higher than the prior probability of a random person testing positive.

And so, as and when the government steps up contact tracing (as it does in the early days of each new wave), test positivity ratio goes up, as more “high prior probability” people get tested. Similarly, whether other states require a negative test to travel affects positivity ratio – the more the likelihood that you need a test to travel, the more likely that “low prior probability” people will take the test, and the lower the ratio will be. Or when governments decide to “randomly test” people (puling them off the streets of whatever), the ratio will come down.

In other words – the ratio can be easily gamed by governments, apart from just being influenced by government policy.

So what do we do now? How do we know whether the Covid-19 situation is serious enough to merit clamping down on people’s liberties? If test positivity ratio is a “shit metric” what can be a better one?

In this particular case (writing this on 3rd Jan 2022), absolute number of positive cases is as bad a metric as test positivity – over the last 3 months, the number of tests conducted in Bangalore has been rather steady. Moreover, the theory so far has been that Omicron is far less deadly than earlier versions of Covid-19, and the vaccination rate is rather high in Bangalore.

While defining metrics, sometimes it is useful to go back to first principles, and think about why we need the metric in the first place and what we are trying to optimise. In this particular case, we are trying to see when it makes sense to cut down economic activity to prevent the spread of the disease.

And why do we need lockdowns? To prevent hospitals from getting overwhelmed. You might remember the chaos of April-May 2021, when it was near impossible to get a hospital bed in Bangalore (even crematoriums had long queues). This is a situation we need to avoid – and the only one that merits lockdowns.

One simple measure we can use is to see how many hospital beds are actually full with covid patients, and if that might become a problem soon. Basically – if you can measure something “close to the problem”, measure it and use that as the metric. Rather than using proxies such as test positivity.

Because test positivity depends on too many factors, including government action. Because we are dealing with a new variant here, which is supposedly less severe. Because most of us have been vaccinated now, our response to getting the disease will be different. The change in situation means the old metrics don’t work.

It’s interesting that the Mumbai municipal corporation has started including bed availability in its daily reports.

Modelling for accuracy

Recently I’ve been remembering the first assignment of my “quantitative methods 2” course at IIMB back in 2004. In the first part of that course, we were learning regression. And so this assignment involved a regression problem. Not too hard at first sight – maybe 3 explanatory variables.

We had been randomly divided into teams of four. I remember working on it in the Computer Centre, in close proximity to some other teams. I remember trying to “do gymnastics” – combining variables, transforming them, all in the hope of trying to get the “best possible R square”. From what I remember, most of the groups went “R square hunting” that day. The assignment had been cleverly chosen such that for an academic exercise, the R Square wasn’t very high.

As an aside – one thing a lot of people take a long time to come to terms with is that in “real life” (industry problems) R squares aren’t usually that high. Forecast accuracy isn’t that high. And that the elegant methods they had learnt back in school / academia may not be as elegant any more in industry. I think I’ve written about this, but I can’t find the link now.

Anyway, back to QM2. I remember the professor telling us that three groups would be chosen at random on the day of the assignment submission, and from each of these three groups one person would be chosen at random who would have to present the group’s solution to the class. I remember that the other three people in my group all decided to bunk class that day! In any case, our group wasn’t called to present.

The whole point of this massive build up is – our approach (and the approach of most other groups) had been all wrong. We had just gone in a mad hunt for R square, not bothering to figure out whether the wild transformations and combinations that we were making made any business sense. Moreover, in our mad hunt for R square, we had all forgotten to consider whether a particular variable was significant, and if the regression itself was significant.

What we learnt was that while R square matters, it is not everything. The “model needs to be good”. The variables need to make sense. In statistics you can’t just go about optimising for one metric – there are several others. And this lesson has stuck with me. And guides how I approach all kinds of data modelling work. And I realise that is in conflict with the way data science is widely practiced nowadays.

The way data science is largely practiced in the wild nowadays is precisely a mad hunt for R Square (or area under ROC curve, if you’re doing a classification problem). Whether the variables used make sense doesn’t matter. Whether the transformations are sound doesn’t matter. It doesn’t matter at all whether the model is “good”, or appropriate – the only measure of goodness of the model seems to be the R square!

In a way, contests such as Kaggle have exacerbated this trend. In contests, typically, there is a precise metric (such as R Square) that you are supposed to maximise. With contests being evaluated algorithmically, it is difficult to evaluate on multiple parameters – especially not whether “the model is good”. And since nowadays a lot of data scientists hone their skills by participating in contests such as on Kaggle, they are tuned to simply go R square hunting.

Also, the big difference between Kaggle and real life is that in Kaggle, the model that you build doesn’t matter. It’s just a combination. You get the best R square. You win. You take the prize. You go home.

You don’t need to worry about how the data for the model was collected. The model doesn’t have to be implemented. No business decisions need to be made based on the model. Contest done, model done.

Obviously that is not how things work in real life. Building the model is only one in a long series of steps in solving the business problem. And when you focus too much on just one thing – the model’s accuracy in the data that you have been given, a lot can be lost in the rest of the chain (including application of the model in future situations).

And in this way, by focussing on just a small portion of the entire data science process (model building), I think Kaggle (and other similar competition platforms) has actually done a massive disservice to data science itself.

Tailpiece

This is completely unrelated to the rest of the post, but too small to merit a post of its own.

Suppose you ask a software engineer to sort a few datasets. He goes about applying bubble sort, heap sort, quick sort, insertion sort and a whole host of other techniques. And then picks the one that sorted the given datasets fastest.

That’s precisely how it seems “data science” is practiced nowadays

Junior Data Scientists

Since this is a work related post, I need to emphasise that all opinions in this are my own, and don’t reflect that of any organisation / organisations I might be affiliated with

The last-released episode of my Data Chatter podcast is with Abdul Majed Raja, a data scientist at Atlassian. We mostly spoke about R and Python, the two programming languages / packages most used for data science, and spoke about their relative merits and demerits.

While we mostly spoke about R and Python, Abdul’s most insightful comment, in my opinion, had to do with neither. While talking about online tutorials and training, he spoke about how most tutorials related to data science are aimed at the entry level, for people wanting to become data scientists, and that there was very little readymade material to help people become better data scientists.

And from my vantage point, as someone who has been heavily trying to recruit data scientists through the course of this year, this is spot on. A lot of profiles I get (most candidates who apply to my team get put through an open ended assignment) seem uncorrelated with the stated years of experience on their CVs. Essentially, a lot of them just appear “very junior”.

This “juniority”, in most cases, comes through in the way that people have done their assignments. A telltale sign, for example, is an excessive focus on necessary but nowhere sufficient things such as data cleaning, variable transformation, etc. Another telltale sign is the simple application of methods without bothering to explain why the method was chosen in the first place.

Apart from the lack of tutorials around, one reason why the quality of data science profiles continues to remain “junior” could be the organisation of teams themselves. To become better at your job, you need interact with people who are better than you at your job. Unfortunately, the rapid rise in demand for data scientists in the last decade has meant that this peer learning is not always there.

Yes – if you are a bunch of data scientists working together, you can pull each other up. However, if many of you have come in through the same process, it is that much more difficult – there is no benchmark for you.

The other thing is the structure of the teams (I’m saying this with very little data, so call me out if I’m bullshitting) – unlike software engineers, data scientists seldom work in large teams. Sometimes they are scattered across the organisation, largely working with tech or business teams. In any case, companies don’t need that many data scientists. So the number is low to start off with as well.

Another reason is the structure of the market – for the last decade the demand for data scientists has far exceeded the available supply. So that has meant that there is no real reason to upskill – you’ll get a job anyway.

Abdul’s solution, in the absence of tutorials, is for data scientists to look at other people’s code. The R community, for example, has a weekly Tidy Tuesday data challenge, and a lot of people who take that challenge put up their code online. I’m pretty certain similar resources exist for Python (on Kaggle, if not anywhere else).

So for someone who wants to see how other data scientists work and learn from them, there is plenty of resources around.

PS: I want to record a podcast episode on the “pile stirring” epidemic in machine learning (where people simply throw methods at a dataset without really understanding why that should work, or understanding the basic math of different methods). So far I’ve been unable to find a suitable guest. Recommendations welcome.