The Science in Data Science

The science in “data science” basically represents the “scientific method”.

It’s a decade since the phrase “data scientist” got coined, though if you go on LinkedIn, you will find people who claim to have more than two years of experience in the subject.

The origins of the phrase itself are unclear, though some sources claim that it came out of this HBR article in 2012 written by Thomas Davenport and DJ Patil (though, in 2009, Hal Varian, formerly Google’s Chief Economist had said that the “sexiest job of the 21st century” will be that of a statistician).

Some of you might recall that in 2018, I had said that “I’m not a data scientist any more“. That was mostly down to my experience working with companies in London, where I found that data science was used as a euphemism for “machine learning” – something I was incredibly uncomfortable with.

With the benefit of hindsight, it seems like I was wrong. My view on data science being a euphemism for machine learning came from interacting with small samples of people (though it could be an English quirk). As I’ve dug around over the years, it seems like the “science” in data science comes not from the maths in machine learning, but elsewhere.

One phenomenon that had always intrigued me was the number of people with PhDs, especially NOT in maths, computer science of statistics, who have made a career in data science. Initially I dismissed it down to “the gap between PhD and tenure track faculty positions in science”. However, the numbers kept growing.

The more perceptive of you might know that I run a podcast now. It is called “Data Chatter“, and is ten episodes old now. The basic aim of the podcast is for me to have some interesting conversations – and then release them for public benefit. Yeah, yeah.

So, there was this thing that intrigued me, and I have a podcast. I did what you would have expected me to do – get on a guest who went from a science background to data science. I got Dhanya, my classmate from school, to talk about how her background with a PhD in neuroscience has helped her become a better data scientist.

It is a fascinating conversation, and served its primary purpose of making me understand what the “science” in data science really is. I had gone into the conversation expecting to talk about some machine learning, and how that gets used in academia or whatever. Instead, we spoke for an hour about designing experiments, collecting data and testing hypotheses.

The science in “data science” basically represents the “scientific method“. What Dhanya told me (you should listen to the conversation) is that a PhD prepares you for thinking in the scientific method, and drills into you years of practice in it. And this is especially true of “experimental” PhDs.

And then, last night, while preparing the notes for the podcast release, I stumbled upon the original HBR article by Thomas Davenport and DJ Patil talking about “data science”. And I found that they talk about the scientific method as well. And I found that I had talked about it in my newsletter as well – only to forget it later. This is what I had written:

Reading Patil and Davenport’s article carefully suggests, however, that companies might be making a deliberate attempt at recruiting pure science PhDs for data scientist roles.

The following excerpts from the article (which possibly shaped the way many organisations think about data science) can help us understand why PhDs are sought after as data scientists.

  • Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time (Ed: the article was published in late 2012, so we’re almost “five years later” now)
  • Perhaps it’s becoming clear why the word “scientist” fits this emerging role. Experimental physicists, for example, also have to design equipment, gather data, conduct multiple experiments, and communicate their results.
  • Some of the best and brightest data scientists are PhDs in esoteric fields like ecology and systems biology.
  • It’s important to keep that image of the scientist in mind—because the word “data” might easily send a search for talent down the wrong path

Patil and Davenport make it very clear that traditional “data analysts” may not make for great data scientists.

We learn, and we forget, and we re-learn. But learning is precisely what the scientific method, which underpins the “science” in data science, is all about. And it is definitely NOT about machine learning.

The Fragile Charioteer

A few days back, I was thinking of an interesting counterfactual in the Mahabharata. As most people know, the story goes that Arjuna went to battle with his charioteer Krishna, and got jitters looking at all his relatives and elders on the other side, and almost lost the will to fight.

And then Krishna recited to him the Bhagavad Gita, which inspired Arjuna to get back to battle, and with Krishna’s expert charioteering (and occasional advice), Arjuna led the Pandavas to (an ultimately pyrrhic) victory in the war.

A long time back I had introduced my blog readers to the “army of monkeys” framework. In that I had contrasted the war in Ramayana (a seemingly straightforward war fought against a foreign king who had kidnapped the hero’s wife) to the war in the Mahabharata (a more complex war fought between cousins).

Given that the Ramayana war was largely straightforward, with the only trickery being in the form of special weapons, going to war with an army of monkeys was a logical choice. Generals on both sides apart, the army of monkeys helped defeat the Lankan army, and the war (and Sita) was won.

The Mahabharata war was more complex, with lots of “mental trickery” (one of which almost led Arjuna to quit the war) and deception from both sides. While LOTS of soldiers died (the story goes that almost all the Kshatriyas in India died in the war), the war was ultimately won in the mind.

In that sense, the Pandavas’ choice of choosing a clever but non-combatant Krishna rather than his entire army (which fought on the side of the Kauravas) turned out to be prescient.

When I wrote the original post on this topic, I was a consultant, and had gotten mildly annoyed at a prospective client deciding to engage an army rather than my trickery for a problem they were facing. Now, I’m part of a company, and I’m recruiting heavily for my team, and I sometimes look at this question from the other side.

One advantage of an uncorrelated army of monkeys is that not all of them will run away together. Yes, some might run away from time to time, but you keep getting new monkeys, and on a consistent basis you have an army.

On the other hand, if you decide to go with a “clever charioteer”, you run the risk that the charioteer might choose to run away one day. And the problem with clever charioteers is that no two of them are alike, and if one runs away, he is not easy to replace (you might have to buy a new chariot to suit the new charioteer).

Maybe that’s one reason why some companies choose to hire armies of monkeys rather than charioteers?

Then again, I think it depends upon the problem at hand. If the “war” (set of business problems) to be fought is more or less straightforward, an army of monkeys is a superior choice. However, if you are defining the terrain rather than just navigating it, a clever charioteer, however short-lived he might be, might just be a superior choice.

It was this thought of fleeing charioteers that made me think of the counterfactual with which I begin this post. What do you think about this?

PS: I had thought about this post a month or two back, but it is only today that I’m actually getting down to writing it. It is strictly a coincidence that today also happens to be Sri Krishna Janmashtami.

Enjoy your chakli!

Topography of Bangalore

My day on Twitter didn’t start out too well today. I wrote this:

As I’ve stayed on for longer, with more data, things have improved today. I’ve learnt a few things, had a few conversations, and watched some fights. But so far, my day has been made by this article about Bangalore’s topography and development.

I’m halfway through reading it, so can’t say yet if I can agree with its conclusions. But what I really really like about the article is the maps. The main map they have is a topographical map of Bangalore (unfortunately, focusses on the cantonment area, so my areas are left out), and then zooming in to bits to explore development.

Topography of Bangalore, from the India Forum article

So many insights already from this:

  1. There is a clear correlation between areas that are perceived to be “posh” and elevation. The better planned areas of Bangalore are built on higher ground than the worse planned.
  2. “High grounds” lives up to its name
  3. While the article (so far) is mainly about construction of the cantonment, the preference for high areas post independence is also evident. From the bottom of the map seen above, you can broadly identify the northern boundary of the area that is now Jayanagar and Basavanagudi. Similarly, the Vidhana Soudha is built at pretty much the highest part of Bangalore (before the Metro came up, you could see the Vidhana Soudha by standing on top of the Trinity Church spire)

Later on in the article there is a more zoomed-out map of Bangalore. And that confirms that Jayanagar is indeed on lofty land.

Jayanagar is right at the bottom of this image. It’s interesting that parts of Banashankari (a rather hilly area) are actually low-lying

Progressing in the article, and it goes off into the (not unexpected) caste and class conflict territory. In any case, I’ve got my value from it. These maps are absolutely fascinating! I hope you like them as well

Friends

This is a story written by my daughter, who is now 4 5/6 years old. She typed this up on this computer, so I’m just copy pasting things here. 

I find something weirdly magical about this story. No, it doesn’t only have to do with the fact that it was written by my daughter. The format of the story makes it seem like there’s some weird literary quality about it. So it makes sense to share this with the wider world.

Read and enjoy. 

 

ones simon says hi

ylou says hi

vrala mogyvoshy  says hi

vlala sintti says hi

vgurule hn says  hi

ltha says hi

mugda says hi

kartik says hi

pinky says hi

adya says hi

jeje says hi

grulmnikan says hi

fivesix says hi

tykrs says hi

cheche says hi

cunti says hi

rats says hi

ujis says hi

xeon says hi

oganesson says hi

de end

You might be wondering who the character in the second last line is. You can find it on the periodic table 🙂 (and “xeon” is a typo. She says she meant to write “xenon”)

Goldilocks and Barbells

Most children learn the story of Goldilocks and the Three Bears. Goldilocks finds the bears’ home, and tries out random things there. Pretty much for everything she tries, there will be three versions (each belonging to one of the bears), with one being <too extreme>, the second being <too extreme at the other end> and the third being “just right”.

The basic message can be summarised as “extremes bad, means good”. In fact, even if you didn’t learn the story as a child (I didn’t), the message of “doing everything in moderation” gets impressed upon you from various quarters. “Don’t eat too much, don’t eat too little, eat in moderation” is possibly the most prominent example of this.

And in some way we have all internalised this messaged. That both too much and too little of everything is bad, and it’s the middle path that is the right one.

And then on the other side, a concept that has always existed but formally articulated fairly recently, is the “barbell“. First articulated by Nassim Nicholas Taleb as an investment strategy, it talks about investing in a combination of extremes and eschewing the means. In Taleb’s original case, it was about an investment strategy that is a mix of low-risk bonds and high-risk (long) out-of-the-money options, that together give a low-risk winning portfolio in the long run. This ran contrary to “modern portfolio theory” that tries to get a mix of assets that maximise expected returns and minimise standard deviation (note I’m saying standard deviation and not “risk” – they’re not the same).

And this strategy applies pretty much everywhere in life. There are a lot of things where the only way you can benefit is by “being all in”. Doing things in moderation can actually be hurtful, and combinations that have a “little bit of everything” can be suboptimal to a simple superposition of extremes.

My breakfast is a barbell, for example. I either skip it completely (nearly zero calories from black coffee only), or have a big breakfast with at least two eggs. A light breakfast completely messes up my day.

My exercise is a barbell (no pun intended). I either lift heavy weights (attached to a barbell) or do nothing. Exercises with light weights make me feel miserable.

In my nearly eight month long return to corporate life, I haven’t taken many days off. My philosophy there is that if I take off, I should be able to completely take off (no “one email here”), and have done so only when it’s easy to do so.

You can think of corporate strategy and a company’s focus being a barbell.

The list goes on. The point is – life is full of barbells, or we can make the most of life by using barbell strategies. Do either this extreme or that extreme, but don’t get confused and do something in the middle.

The problem, however, is that we get brought up on goldilocks, not barbells. And think that the middle path is superior to the extremes. It isn’t always so.

Losing My Religion

In terms of religion, I had a bit of a strange upbringing. My father was a rationalist, bordering on atheist. My mother was insanely religious, even following a godman. And no – I never once saw them fight about this.

Both of them tried to impress me with their own religions. My mother tried to inculcate in me the habit of praying every morning, and looking for strange patterns (“if this flower on this photo falls, then it will be a good day” types). My father would refute most of these things saying “how can you be a student of science and still believe this stuff?”. I suppose I consumed a lot of coffy bite when I was a kid.

In any case, with a combination of influences, both internal and external, in my early youth I was this strange concoction of “not religious but superstitious”. I had both a “lucky shirt” and a “lucky pen”. Back in class 12, I had convinced myself that “Wednesdays are a particularly bad day for me”.

I really don’t know if this has anything to do with my upbringing, but I would see patterns everywhere. I would draw correlations between random unconnected things, and assume causality. I staunchly refused to admit that I was religious, but allowed for strange patterns and correlations nevertheless.

When I had five minor car accidents during the course of 2007 (it wasn’t a great year for me, and I was quite messed up), I believed (or maybe was made to believe) that it was “my car’s way of protecting me” (I wasn’t hurt in any of those, though the car took a lot of beatings and scratchings). I had come to believe that a particular job didn’t go well because on the first day of work, I had splashed water on a kid on my way back by driving fast through a puddle.

The general discourse nowadays is that religion improves people’s mental health. That it helps people see meaning and purpose in their lives, and live through tragedies and other kinds of unhappiness. A common discourse on the right, on social media, is that it is the lack of religion that has led to the mental health epidemic that we have been going through for a while.

The way I see it, based on my own experience, this is completely backward. The basic thing about religion, at least based on my mixed upbringing, is “random correlations”. A lot of religion can be explained as “you do this, God will be happy with you and give you that”. Or that something was just “meant to be”, maybe based on actions in one’s past lives.

Religion is about “being a good person” and “karma”, and that all your mistakes will necessarily get punished, if not in this life in the next. The long period over which karma operates significantly increases the scope of random correlations that you can draw from life.

First of all I’m good at pattern recognition (something that has immensely helped me in my academics and careers). The downside of being good at pattern recognition is that there can be LOTS of false positives in patterns that you recognise. And when you recognise patterns that don’t really exist, you learn the wrong things, and after that live life the wrong way. And I think that was happening to me for a very very long time.

And so came the lucky shirts, the lucky pens, the precise order in which I would check websites at work every morning and many other things that were actually damaging to life, especially mental health. The pattern recognition was making me miserable, and the religion and superstition that I had come to believe in gave credence to these patterns, and (with the benefit of hindsight) made me more miserable.

In 2012, after having burnt out for the third time in six years, I began to see a psychiatrist and take antidepressants. It was the same time when I had started my “portfolio life”, and one of the items in that portfolio was volunteering with the Takshashila Institution, where I was asked to teach a class on logical fallacies.

That’s possibly a funny trigger, but hours of lecturing about “correlation not implying causation” meant that I started finally seeing the random correlations that I had formed in my own head. And one by one, I started dismantling them. There were no lucky days any more. There wasn’t that much karma any more. I started feeling less worried about things I wanted to say. I started realising that being “good” is good for its own merits, and not because some karma recommends that you should be good.

And I started feeling happier. Over the course of time, it seemed like a big load had been taken off my head. And so, whenever I see discourse on social media (and in books) that religion makes people happier, I fail to understand it.

In January 2014, I met an old friend for dinner. While walking back to the parking lot, he casually asked me what my views on religion were. I thought for a minute and said, “well, I firmly believe that correlation does not imply causation. And this means I can’t be religious”. That’s when I became convinced that I had lost my religion, and had become happier for it. And I continue to be happy because I’m not religious.

The Misfit Job Market

Exactly 15 years ago, I was looking for a job. I had graduated from IIMB four months earlier, taken my first ever full time job 3 months earlier, and was already serving notice. Very quickly on, I had figured that I was not a good fit for the job that I had taken up, and so decided to cut my losses and move on.

The only problem was job hunting was hard. Back then, most people I spoke to seemed suspicious of me because I was getting out of my first job so early. For the longest time (years later), people spoke to me as if there was something wrong with me because I had quit my first job within three months. Finally I ended up taking a 20% pay cut to take another job where I seemed a better fit.

Thinking back, I don’t think I’m alone. The sheer randomness of the campus placement process means that a lot of people end up in jobs that they are ill suited for, purely based on a bit of bad judgment here and a lucky interview there. And most smart people figure out quickly enough that in case they are in jobs they are not a good fit for, it’s better to cut losses and move on. If it is their first ever jobs (applies for undergrad jobs, and for MBAs without prior work experience), the desperation to get out of their misfit jobs will be high.

I think this is a highly underserved market. Companies fall head over heels over themselves to access premium slots in the random process called campus placements, without realising that a significant part of the same pool will (theoretically) be available for a proper interview just a few months hence.

5-6 years back, an old friend of mine had started a company which was essentially a clearinghouse targeted at this precise market – to enable companies hire people in their first years of employment. Unfortunately the company didn’t take off, suggesting that the market design problem is not easy to solve.

Anyway, in case you are a just-graduated student who believes you are a misfit in your first job, and instead want to do analytics, get in touch with me. Having been on the other side, I’m more than happy to fish in this pool, and I know that I’ll get some temporarily undervalued talent here.

Just that I don’t know what sort of market or clearinghouse I need to go to to tap this supply, and so I’m putting out a bid here in the form of this blogpost.

PS: In case you’re a recent reader of my blog, I’ve written a book on market design.

Wokes and Jokes

Q: How do you know a woke is losing an argument?
A: They start talking about privilege.

No, this is not a post that seeks to make jokes about wokes. Instead, here, I seek to explore what kind of jokes wokes like, assuming there are jokes they like, that is.

A long time back, I had written here that the problem with the woke movement is that it denies people their jokes. Because jokes are inherently at the expense of someone (a person or group of people or thing), and because extreme political correctness means that making fun of a person or group of people is not polite, political correctness means a lot of jokes go out of the window.

Think of all the jokes that you enjoyed when you are in high school – it is likely that you won’t be able to put most of those jokes on social media nowadays – since it’s not kosher to make fun of the people / groups of people they make fun of.

And so, one day recently, I started thinking if wokes laugh at all – if making fun of people or groups of people is not done, how do they get their laughs? And then I realised that if you look at standup comedians, there are a bunch of them who can be broadly described as “woke” (as per today’s standards – I have NO CLUE how well this will hold up). So what gives? How can wokes have their jokes when most of our old jokes are not valid any more?

The interesting thing about the woke movement is that they largely depend on group identities. One <insert oppressed community (on whatever axis)> person gets beaten, it is seen as an act of violence against the community. Everything is spoken in group terms. The individual’s individuality doesn’t matter. Everything is analysed in group terms.

Except for the jokes.

Wokes get their jokes because they target particular people. And identification of such people is rather easy. Start with choosing a politician (or politicians) who are definitely anti-woke (Modi, Trump, Johnson, Jair, Orban – at the time of writing). And then build a social network around them, on people who hang out with them, agree with them, retweet them, get retweeted by them, and so on. All of them are worth making fun of.

If you make a joke about Modi, you are NOT making a joke about Gujaratis. If you make a joke about Trump, you are NOT making a joke about builders, or blondes. And these jokes are kosher because the target of the jokes are reviled, or are strongly associated with the reviled.

And a person’s status on whether they can be made fun of or not depends on their associations. You cross the proverbial political floor, you can suddenly gain indemnity or get exposed to being made fun of, spending upon the direction in which you’ve crossed the floor.

I’ve never really been a fan of standup comedy (I think it has a rather low “bit rate”). But this possibly explains why I find it even less tolerable nowadays – most of the jokes are political, and it gets boring after a while.

Then again, as the wokes say, everything is political.

Ranga and Big Data

There are some meeting stories that are worth retelling and retelling. Sometimes you think it should be included in some movie (or at least a TV show). And you never tire of telling the stories.

The way I met Ranga can qualify as one such story. At the outset, there was nothing special about it – both of us had joined IIT Madras at the same time, to do a B.Tech. in Computer Science. But the first conversation itself was epic, and something worth telling again and again.

During our orientation, one of the planned events was “a visit to the facilities”, where a professor would take us around to see the library, the workshops, a few prominent labs and other things.

I remember that the gathering point for Computer Science students was right behind the Central Lecture Theatre. This was the second day of orientation and I’d already met a few classmates by then. And that’s where I found Ranga.

The conversation went somewhat like this:

“Hi I’m Karthik. I’m from Bangalore”.
“Hi I’m Ranga. I’m from Madras. What are your hobbies?”
“I play the violin, I play chess…. ”
“Oh, you play chess? Me too. Why don’t we play a blindfold game right now?”
“Er. What? What do you want to do? Now?”
“Yeah. Let’s start. e4”.
(I finally managed to gather my senses) “c5”

And so we played for the next two hours. I clearly remember playing a Sicilian Dragon. It was a hard fought game until we ended up in an endgame with opposite coloured bishops. Coincidentally, by that time the tour of the facilities had ended. And we called it a draw.

We kept playing through our B.Techs., mostly blindfold in the backbenches of classrooms. Most of the time I would get soundly thrashed. One time I remember going from our class, with the half-played game in our heads, setting it up on a board in Ranga’s room, and continued to play.

In any case, chess apart, we’ve also had a lot of nice conversations over the last 21 years. Ranga runs a big data and AI company called TheDataTeam, so I thought it would be good to record one of our conversations and share it with the world.

And so I present to you the second episode of my new “Data Chatter” podcast. Ranga and I talk about all things “big data”, data architectures, warehousing, data engineering and all that.

As usual, the podcast is available on all podcasting platforms (though, curiously, each episode takes much longer to appear on Google Podcasts after it has released. So this second episode is already there on Spotify, Apple Podcasts, CastBox, etc. but not on Google yet).

Give it a listen. Share it with whoever you think might like it. Subscribe to my podcast. And let me know what you think of it.

Podcast: All Reals

I had spoken here a few times about starting a new “data podcast, right? The first episode is out today, and in this I speak to S Anand, cofounder and CEO of Gramener, about the interface of business with data science.

It’s a long freewheeling conversation, where we talk about data science in general, about Excel, about data visualisations, pie charts, Tufte and all that.

Do listen – it should be available on all podcast platforms, and let me know what you think. Oh, and don’t forget to subscribe to the podcast. New episodes will be out every Tuesday morning.

And if you think you want to be on the podcast, or know someone who wants to be a guest on the podcast, you can reach out. datachatterpodcast AT gmail.