Stable Diffusion and Chat GPT and Logistic Regression

For a long time I have had this shibboleth on whether someone is a “statistics person or a machine learning person”. It is based on what they call regressions where the dependent variable is binary. Statisticians simply call it “logit” (there is also a “probit“).

Now, in terms of implementation as well, there is one big difference between the way “logit” is modelled versus “logistic regression”. For a logit model (if you are using python, you need to use the “statsmodels” package for this, not scikit learn), the number of observations needs to far exceed the number of independent variables.

Else, a matrix that needs to be inverted as part of the solution will turn out to be singular, and there will be no solution. I guess I betrayed my greater background in statistics than in Machine Learning when, in 2018, I wrote this blogpost on machine learning being a “process to tie down coefficients in maths models“.

For “logistic regression” (as opposed to “logit”) puts no such constraint – on the regression matrix being invertible. Instead of actually inverting the matrix, machine learning approaches simply focus on learning the terms of the inverted matrix using gradient descent (basically the opposite of hill climbing), so mathematical inconveniences such as matrices that cannot be inverted are moot there.

And so you have logistic regression models with thousands of variables, often calibrated with a fewer number of data points. To be honest, I can’t understand this fully – without sufficient information (data points) to calibrate the coefficients, there will always be a sense of randomness in the output. The model has too many degrees of freedom, and so there is additional information the model is supplying (apart from what was supplied in the training data!).

Of late I have been playing a fair bit with generative AI (primarily ChatGPT and Stable Diffusion). The other day, my daughter and I were alone in my in-laws’ house, and I told her “look I’ve brought my personal laptop along, if you want we can play with it”. And she had demanded that she “play with stable diffusion”. This is the image she got for “tiger chasing deer”.

I have written earlier here about how the likes of ChatGPT and Stable Diffusion in a way redefine “information content“.

 

And if you think about it, almost by definition, “generative AI” creates information (and hallucinates, like in the above pic). Traditionally speaking, a “picture is worth a thousand words”, but if you can generate a picture with just a few words of prompt, the information content in it is far less than a thousand words.

In some sense, this reminds me of “logistic regression” once again. By definition (because it is generative), there is insufficient “tying down of coefficients”, because of which the AI inevitably ends up “adding value of its own”, which by definition is random.

So, you will end up getting arbitrary results. ChatGPT often gives you wrong answers to questions. Dall-E and Midjourney and Stable Diffusion will return nonsense images such as the above. Because a “generative AI” needs to create information, by definition, all the coefficients of the model cannot be well calibrated. 

And the consequence of this is that however good these AIs get, however much data is used to train them, there will always be an element of randomness to them. There will always be test cases where they give funny results.

No, AGI is not here yet.

Chat GPT and Information Content

For the last week or so I’ve been fooling around with Chat GPT. It’s been a lot of fun, if you can endure the verbosity of its answers. Yesterday, at work, I used it for tech support when some people I was talking to were using too much jargon.

Before I came home from work yesterday, my last conversation was with a couple of recruiters, about a role that I’m hiring for. So when I got home, on a whim, I wondered if I could check out of Chat GPT could write job descriptions for me.

I asked:

write a job description for a senior data scientist at India’s largest logistics company. 4-7 years exp. Needs to have experience in ML, OR and statistics, and be able to work close to business and on ambiguous problems

It answered (pasting in full since the screenshot above is limited):

Job Title: Senior Data Scientist

Company: India’s Largest Logistics Company

Location: India

Experience: 4-7 years

We are seeking an experienced and highly skilled Senior Data Scientist to join our team at India’s largest logistics company. In this role, you will be responsible for using machine learning, operations research, and statistics to drive business insights and decision making.

As a Senior Data Scientist, you will have a deep understanding of data science techniques and technologies, and will be able to apply them to complex and ambiguous problems. You will be able to work closely with business teams to understand their needs and to develop solutions that drive value for the company.

To be successful in this role, you should have at least 4-7 years of experience in data science, and should have a strong background in machine learning, operations research, and statistics. You should be able to work independently and be comfortable working on complex and ambiguous problems.

Responsibilities:

  • Use machine learning, operations research, and statistics to drive business insights and decision making
  • Work closely with business teams to understand their needs and develop solutions
  • Develop and maintain data pipelines and systems
  • Collaborate with other data scientists and cross-functional teams to develop and implement solutions

Qualifications:

  • Bachelor’s or Master’s degree in a quantitative field, such as computer science, statistics, or engineering
  • At least 4-7 years of experience in data science
  • Strong background in machine learning, operations research, and statistics
  • Ability to work independently and on complex and ambiguous problems
  • Excellent communication and collaboration skills
  • Experience with Python, R, or other programming languages is a plus

We offer a competitive salary and benefits package, as well as opportunities for growth and advancement. If you are an experienced data scientist with a passion for driving business insights and decision making, we encourage you to apply.

In the past when I’ve been hiring, I’ve been a bit annoyed by people asking for job descriptions. I’ve largely found them to be fluff and boilerplate, and I’ve tried to get away with a sentence or two. However, due to the insistence of recruiters and candidates alike, I’ve ended up writing elaborate job descriptions. I don’t need to any more.

This post, however, is not about hiring (but yeah – I am hiring, and the above job description is reasonably accurate). It’s about information content in general.

Traditionally, information content has been measured in bits. A bit is the amount of information contained in an equi-probable binary event.

Sometimes when we find that someone is too verbose and using too many words when fewer would suffice, we say that their bit rate is low. We also use “low bit rate” to describe people such as former Prime Minister Atal Behari Vajpayee, who would speak incredibly slowly.

However, beyond the bit, which is a fairly technical concept, it has been difficult to quantify information content. Sometimes you read an article or a story and find that there is nothing much to it. But given the natural language, and the context of various words, it is impossible to quantify the information content.

Now, with Chat GPT, maybe it becomes a bit easier (though one would need a “reverse chat GPT algo”, to find the set of prompts required for Chat GPT to churn out a particular essay). Above, for example, I’ve shown how much fluff there generally is to the average job description – a fairly short prompt generated this longish description that is fairly accurate.

So you can define the information content of a piece or essay in terms of the number of words in the minimum set of prompts required for Chat GPT (or something like it) to come up with it. If you are a boring stereotypical writer, the set of prompts required will be lower. If you are highly idiosyncratic, then you will need to give a larger number of prompts for Chat GPT to write like you. You know where I’m going.

This evening, in office, a colleague commented that now it will be rather easy to generate marketing material. “Even blogs might become dead, since with a few prompts you can get that content”, he said (it can be a legit service to build off the Chat GPT API to take a tweet and convert it into an essay).

I didn’t tell him then but I have decided to take it up as a challenge. I consider myself to be a fairly idiosyncratic writer, which means I THINK there is a fair bit of information content in what I write, and so this blog will stay relevant. Let’s see how it goes.

PS: I still want to train a GAN on my blog (well over a million words, at last count) and see how it goes. If you know of any tools I can use for this, let me know!

 

A day at an award function

So I got an award today. It is called “exemplary data scientist”, and was given out by the Analytics India Magazine as part of their MachineCon 2022. I didn’t really do anything to get the award, apart from existing in my current job.

I guess having been out of the corporate world for nearly a decade, I had so far completely missed out on the awards and conferences circuit. I would see old classmates and colleagues put pictures on LinkedIn collecting awards. I wouldn’t know what to make of it when my oldest friend would tell me that whenever he heard “eye of the tiger”, he would mentally prepare to get up and go receive an award (he got so many I think). It was a world alien to me.

Parallelly, I used to crib about how while I’m well networked in India, and especially in Bangalore, my networking within the analytics and data science community is shit. In a way, I was longing for physical events to remedy this, and would lament that the pandemic had killed those.

So I was positively surprised when about a month ago Analytics India Magazine wrote to me saying they wanted to give me this award, and it would be part of this in-person conference. I knew of the magazine, so after asking around a bit on legitimacy of such awards and looking at who had got it the last time round, I happily accepted.

Most of the awardees were people like me – heads of analytics or data science at some company in India. And my hypothesis that my networking in the industry was shit was confirmed when I looked at the list of attendees – of 100 odd people listed on the MachineCon website, I barely knew 5 (of which 2 didn’t turn up at the event today).

Again I might sound like a n00b, but conferences like today are classic two sided markets (read this eminently readable paper on two sided markets and pricing of the same by Jean Tirole of the University of Toulouse). On the one hand are awardees – people like me and 99 others, who are incentivised to attend the event with the carrot of the award. On the other hand are people who want to meet us, who will then pay to attend the event (or sponsor it; the entry fee for paid tickets to the event was a hefty $399).

It is like “ladies’ night” that pubs have, where on a particular days of the week, women who go to the pub get a free drink. This attracts women, which in turn attracts men who seek to court the women. And what the pub spends in subsidising the women it makes back in terms of greater revenue from the men on the night.

And so it was at today’s conference. I got courted by at least 10 people, trying to sell me cloud services, “AI services on the cloud”, business intelligence tools, “AI powered business intelligence tools”, recruitment services and the like. Before the conference, I had received LinkedIn requests from a few people seeking to sell me stuff at the conference. In the middle of the conference, I got a call from an organiser asking me to step out of the hall so that a sponsor could sell to me.

I held a poker face with stock replies like “I’m not the person who makes this purchasing decision” or “I prefer open source tools” or “we’re building this in house”.

With full benefit of hindsight, Radisson Blu in Marathahalli is a pretty good conference venue. An entire wing of the ground floor of the hotel is dedicated for events, and the AIM guys had taken over the place. While I had not attended any such event earlier, it had all the markings of a well-funded and well-organised event.

As I entered the conference hall, the first thing that struck me was the number of people in suits. Most people were in suits (though few wore ties; And as if the conference expected people to turn up in suits, the goodie bag included a tie, a pair of cufflinks and a pocket square). And I’m just not used to that. Half the days I go to office in shorts. When I feel like wearing something more formal, I wear polo T-shirts with chinos.

My colleagues who went to the NSE last month to ring the bell to take us public all turned up company T-shirts and jeans. And that’s precisely what I wore to the conference today, though I had recently procured a “formal uniform” (polo T-shirt with company logo, rather than my “usual uniform” which is a round neck T-shirt). I was pretty much the only person there in “uniform”. Towards the end of the day, I saw one other guy in his company shirt, but he was wearing a blazer over it!

Pretty soon I met an old acquaintance (who I hadn’t known would be at the conference). He introduced me to a friend, and we went for coffee. I was eating a cookie with the coffee, and had an insight – at conferences, you should eat with your left hand. That way, you don’t touch the food with the same hand you use to touch other people’s hands (surprisingly I couldn’t find sanitiser dispensers at the venue).

The talks, as expected, were nothing much to write about. Most were by sponsors selling their wares. The one talk that wasn’t by a sponsor was delivered by a guy who was introduced as “his greatgrandfather did this. His grandfather did that. And now this guy is here to talk about ethics of AI”. Full Challenge Gopalakrishna feels happened (though, unfortunately, the Kannada fellows I’d hung out with earlier that day hadn’t watched the movie).

I was telling some people over lunch (which was pretty good) that talking about ethics in AI at a conference has become like worshipping Ganesha as part of any elaborate pooja. It has become the de riguer thing to do. And so you pay obeisance to the concept and move on.

The awards function had three sections. The first section was for “users of AI” (from what I understood). The second (where I was included) was for “exemplary data scientists”. I don’t know what the third was for (my wife is ill today so I came home early as soon as I’d collected my award), except that it would be given by fast bowler and match referee Javagal Srinath. Most of the people I’d hung out with through the day were in the Srinath section of the awards.

Overall it felt good. The drive to Marathahalli took only 45 minutes each way (I drove). A lot of people had travelled from other cities in India to reach the venue. I met a few new people. My networking in data science and analytics is still not great, but far better than it used to be. I hope to go for more such events (though we need to figure out how to do these events without that talks).

PS: Everyone who got the award in my section was made to line up for a group photo. As we posed with our awards, an organiser said “make sure all of you hold the prizes in a way that the Intel (today’s chief sponsor) logo faces the camera”. “I guess they want Intel outside”, I joked. It seemed to be well received by the people standing around me. I didn’t talk to any of them after that, though.

The “intel outside” pic. Courtesy: https://www.linkedin.com/company/analytics-india-magazine/posts/?feedView=all

 

Structures of professions and returns to experience

I’ve written here a few times about the concept of “returns to experience“. Basically, in some fields such as finance, the “returns to experience” is rather high. Irrespective of what you have studied or where, how long you have continuously been in the industry and what you have been doing has a bigger impact on your performance than your way of thinking or education.

In other domains, returns to experience is far less. After a few years in the profession, you would have learnt all you had to, and working longer in the job will not necessarily make you better at it. And so you see that the average 15 years experience people are not that much better than the average 10 years experience people, and so you see salaries stagnating as careers progress.

While I have spoken about returns to experience, till date, I hadn’t bothered to figure out why returns to experience is a thing in some, and only some, professions. And then I came across this tweetstorm that seeks to explain it.

Now, normally I have a policy of not reading tweetstorms longer than six tweets, but here it was well worth it.

It draws upon a concept called “cognitive flexibility theory”.

Basically, there are two kinds of professions – well-structured and ill-structured. To quickly summarise the tweetstorm, well-structured professions have the same problems again and again, and there are clear patterns. And in these professions, first principles are good to reason out most things, and solve most problems. And so the way you learn it is by learning concepts and theories and solving a few problems.

In ill-structured domains (eg. business or medicine), the concepts are largely the same but the way the concepts manifest in different cases are vastly different. As a consequence, just knowing the theories or fundamentals is not sufficient in being able to understand most cases, each of which is idiosyncratic.

Instead, study in these professions comes from “studying cases”. Business and medicine schools are classic examples of this. The idea with solving lots of cases is NOT that you can see the same patterns in a new case that you see, but that having seen lots of cases, you might be able to reason HOW to approach a new case that comes your way (and the way you approach it is very likely novel).

Picking up from the tweetstorm once again:

 

It is not hard to see that when the problems are ill-structured or “wicked”, the more the cases you have seen in your life, the better placed you are to attack the problem. Naturally, assuming you continue to learn from each incremental case you see, the returns to experience in such professions is high.

In securities trading, for example, the market takes very many forms, and irrespective of what chartists will tell you, patterns seldom repeat. The concepts are the same, however. Hence, you treat each new trade as a “case” and try to learn from it. So returns to experience are high. And so when I tried to reenter the industry after 5 years away, I found it incredibly hard.

Chess, on the other hand, is well-structured. Yes, alpha zero might come and go, but a lot of the general principles simply remain.

Having read this tweetstorm, gobbled a large glass of wine and written this blogpost (so far), I’ve been thinking about my own profession – data science. My sense is that data science is an ill-structured profession where most practitioners pretend it is well-structured. And this is possibly because a significant proportion of practitioners come from academia.

I keep telling people about my first brush with what can now be called data science – I was asked to build a model to forecast demand for air cargo (2006-7). The said demand being both intermittent (one order every few days for a particular flight) and lumpy (a single order could fill up a flight, for example), it was an incredibly wicked problem.

Having had a rather unique career path in this “industry” I have, over the years, been exposed to a large number of unique “cases”. In 2012, I’d set about trying to identify patterns so that I could “productise” some of my work, but the ill-structured nature of problems I was taking up meant this simply wasn’t forthcoming. And I realise (after having read the above-linked tweetstorm) that I continue to learn from cases, and that I’m a much better data scientist than I was a year back, and much much better than I was two years back.

On the other hand, because data science attracts a lot of people from pure science and engineering (classically well-structured fields), you see a lot of people trying to apply overly academic or textbook approaches to problems that they see. As they try to divine problem patterns that don’t really exist, they fail to recognise novel “cases”. And so they don’t really learn from their experience.

Maybe this is why I keep saying that “in data science, years of experience and competence are not correlated”. However, fundamentally, that ought NOT to be the case.

This is also perhaps why a lot of data scientists, irrespective of their years of experience, continue to remain “junior” in their thinking.

PS: The last few paragraphs apply equally well to quantitative finance and economics as well. They are ill-structured professions that some practitioners (thanks to well-structured backgrounds) assume are well-structured.

Modelling for accuracy

Recently I’ve been remembering the first assignment of my “quantitative methods 2” course at IIMB back in 2004. In the first part of that course, we were learning regression. And so this assignment involved a regression problem. Not too hard at first sight – maybe 3 explanatory variables.

We had been randomly divided into teams of four. I remember working on it in the Computer Centre, in close proximity to some other teams. I remember trying to “do gymnastics” – combining variables, transforming them, all in the hope of trying to get the “best possible R square”. From what I remember, most of the groups went “R square hunting” that day. The assignment had been cleverly chosen such that for an academic exercise, the R Square wasn’t very high.

As an aside – one thing a lot of people take a long time to come to terms with is that in “real life” (industry problems) R squares aren’t usually that high. Forecast accuracy isn’t that high. And that the elegant methods they had learnt back in school / academia may not be as elegant any more in industry. I think I’ve written about this, but I can’t find the link now.

Anyway, back to QM2. I remember the professor telling us that three groups would be chosen at random on the day of the assignment submission, and from each of these three groups one person would be chosen at random who would have to present the group’s solution to the class. I remember that the other three people in my group all decided to bunk class that day! In any case, our group wasn’t called to present.

The whole point of this massive build up is – our approach (and the approach of most other groups) had been all wrong. We had just gone in a mad hunt for R square, not bothering to figure out whether the wild transformations and combinations that we were making made any business sense. Moreover, in our mad hunt for R square, we had all forgotten to consider whether a particular variable was significant, and if the regression itself was significant.

What we learnt was that while R square matters, it is not everything. The “model needs to be good”. The variables need to make sense. In statistics you can’t just go about optimising for one metric – there are several others. And this lesson has stuck with me. And guides how I approach all kinds of data modelling work. And I realise that is in conflict with the way data science is widely practiced nowadays.

The way data science is largely practiced in the wild nowadays is precisely a mad hunt for R Square (or area under ROC curve, if you’re doing a classification problem). Whether the variables used make sense doesn’t matter. Whether the transformations are sound doesn’t matter. It doesn’t matter at all whether the model is “good”, or appropriate – the only measure of goodness of the model seems to be the R square!

In a way, contests such as Kaggle have exacerbated this trend. In contests, typically, there is a precise metric (such as R Square) that you are supposed to maximise. With contests being evaluated algorithmically, it is difficult to evaluate on multiple parameters – especially not whether “the model is good”. And since nowadays a lot of data scientists hone their skills by participating in contests such as on Kaggle, they are tuned to simply go R square hunting.

Also, the big difference between Kaggle and real life is that in Kaggle, the model that you build doesn’t matter. It’s just a combination. You get the best R square. You win. You take the prize. You go home.

You don’t need to worry about how the data for the model was collected. The model doesn’t have to be implemented. No business decisions need to be made based on the model. Contest done, model done.

Obviously that is not how things work in real life. Building the model is only one in a long series of steps in solving the business problem. And when you focus too much on just one thing – the model’s accuracy in the data that you have been given, a lot can be lost in the rest of the chain (including application of the model in future situations).

And in this way, by focussing on just a small portion of the entire data science process (model building), I think Kaggle (and other similar competition platforms) has actually done a massive disservice to data science itself.

Tailpiece

This is completely unrelated to the rest of the post, but too small to merit a post of its own.

Suppose you ask a software engineer to sort a few datasets. He goes about applying bubble sort, heap sort, quick sort, insertion sort and a whole host of other techniques. And then picks the one that sorted the given datasets fastest.

That’s precisely how it seems “data science” is practiced nowadays

Junior Data Scientists

Since this is a work related post, I need to emphasise that all opinions in this are my own, and don’t reflect that of any organisation / organisations I might be affiliated with

The last-released episode of my Data Chatter podcast is with Abdul Majed Raja, a data scientist at Atlassian. We mostly spoke about R and Python, the two programming languages / packages most used for data science, and spoke about their relative merits and demerits.

While we mostly spoke about R and Python, Abdul’s most insightful comment, in my opinion, had to do with neither. While talking about online tutorials and training, he spoke about how most tutorials related to data science are aimed at the entry level, for people wanting to become data scientists, and that there was very little readymade material to help people become better data scientists.

And from my vantage point, as someone who has been heavily trying to recruit data scientists through the course of this year, this is spot on. A lot of profiles I get (most candidates who apply to my team get put through an open ended assignment) seem uncorrelated with the stated years of experience on their CVs. Essentially, a lot of them just appear “very junior”.

This “juniority”, in most cases, comes through in the way that people have done their assignments. A telltale sign, for example, is an excessive focus on necessary but nowhere sufficient things such as data cleaning, variable transformation, etc. Another telltale sign is the simple application of methods without bothering to explain why the method was chosen in the first place.

Apart from the lack of tutorials around, one reason why the quality of data science profiles continues to remain “junior” could be the organisation of teams themselves. To become better at your job, you need interact with people who are better than you at your job. Unfortunately, the rapid rise in demand for data scientists in the last decade has meant that this peer learning is not always there.

Yes – if you are a bunch of data scientists working together, you can pull each other up. However, if many of you have come in through the same process, it is that much more difficult – there is no benchmark for you.

The other thing is the structure of the teams (I’m saying this with very little data, so call me out if I’m bullshitting) – unlike software engineers, data scientists seldom work in large teams. Sometimes they are scattered across the organisation, largely working with tech or business teams. In any case, companies don’t need that many data scientists. So the number is low to start off with as well.

Another reason is the structure of the market – for the last decade the demand for data scientists has far exceeded the available supply. So that has meant that there is no real reason to upskill – you’ll get a job anyway.

Abdul’s solution, in the absence of tutorials, is for data scientists to look at other people’s code. The R community, for example, has a weekly Tidy Tuesday data challenge, and a lot of people who take that challenge put up their code online. I’m pretty certain similar resources exist for Python (on Kaggle, if not anywhere else).

So for someone who wants to see how other data scientists work and learn from them, there is plenty of resources around.

PS: I want to record a podcast episode on the “pile stirring” epidemic in machine learning (where people simply throw methods at a dataset without really understanding why that should work, or understanding the basic math of different methods). So far I’ve been unable to find a suitable guest. Recommendations welcome.

The Science in Data Science

The science in “data science” basically represents the “scientific method”.

It’s a decade since the phrase “data scientist” got coined, though if you go on LinkedIn, you will find people who claim to have more than two years of experience in the subject.

The origins of the phrase itself are unclear, though some sources claim that it came out of this HBR article in 2012 written by Thomas Davenport and DJ Patil (though, in 2009, Hal Varian, formerly Google’s Chief Economist had said that the “sexiest job of the 21st century” will be that of a statistician).

Some of you might recall that in 2018, I had said that “I’m not a data scientist any more“. That was mostly down to my experience working with companies in London, where I found that data science was used as a euphemism for “machine learning” – something I was incredibly uncomfortable with.

With the benefit of hindsight, it seems like I was wrong. My view on data science being a euphemism for machine learning came from interacting with small samples of people (though it could be an English quirk). As I’ve dug around over the years, it seems like the “science” in data science comes not from the maths in machine learning, but elsewhere.

One phenomenon that had always intrigued me was the number of people with PhDs, especially NOT in maths, computer science of statistics, who have made a career in data science. Initially I dismissed it down to “the gap between PhD and tenure track faculty positions in science”. However, the numbers kept growing.

The more perceptive of you might know that I run a podcast now. It is called “Data Chatter“, and is ten episodes old now. The basic aim of the podcast is for me to have some interesting conversations – and then release them for public benefit. Yeah, yeah.

So, there was this thing that intrigued me, and I have a podcast. I did what you would have expected me to do – get on a guest who went from a science background to data science. I got Dhanya, my classmate from school, to talk about how her background with a PhD in neuroscience has helped her become a better data scientist.

It is a fascinating conversation, and served its primary purpose of making me understand what the “science” in data science really is. I had gone into the conversation expecting to talk about some machine learning, and how that gets used in academia or whatever. Instead, we spoke for an hour about designing experiments, collecting data and testing hypotheses.

The science in “data science” basically represents the “scientific method“. What Dhanya told me (you should listen to the conversation) is that a PhD prepares you for thinking in the scientific method, and drills into you years of practice in it. And this is especially true of “experimental” PhDs.

And then, last night, while preparing the notes for the podcast release, I stumbled upon the original HBR article by Thomas Davenport and DJ Patil talking about “data science”. And I found that they talk about the scientific method as well. And I found that I had talked about it in my newsletter as well – only to forget it later. This is what I had written:

Reading Patil and Davenport’s article carefully suggests, however, that companies might be making a deliberate attempt at recruiting pure science PhDs for data scientist roles.

The following excerpts from the article (which possibly shaped the way many organisations think about data science) can help us understand why PhDs are sought after as data scientists.

  • Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time (Ed: the article was published in late 2012, so we’re almost “five years later” now)
  • Perhaps it’s becoming clear why the word “scientist” fits this emerging role. Experimental physicists, for example, also have to design equipment, gather data, conduct multiple experiments, and communicate their results.
  • Some of the best and brightest data scientists are PhDs in esoteric fields like ecology and systems biology.
  • It’s important to keep that image of the scientist in mind—because the word “data” might easily send a search for talent down the wrong path

Patil and Davenport make it very clear that traditional “data analysts” may not make for great data scientists.

We learn, and we forget, and we re-learn. But learning is precisely what the scientific method, which underpins the “science” in data science, is all about. And it is definitely NOT about machine learning.

How Python swallowed R

A week ago, I put a post on LinkedIn saying if someone else working in analytics / data science primarily uses R for their work, I would like to chat.

I got two responses, one of which was from a guy who strictly isn’t in analytics / data science, but needs to analyse large amounts of data for his work. I had a long chat with the other guy today.

Yesterday I put the same post on Twitter, and have got a few more responses from there. However, it is staggering. An overwhelming majority of data people who I know work in Python. One of the reasons I put these posts was to assure myself that I’m not alone in using R, though the response so far hasn’t given me too much of an assurance.

So why do most companies end up using Python for analytics, even when R is clearly better for things like data wrangling, reporting, visualisation, dashboarding, etc.? I have a few theories on this, and I think all of them come together to result in python having its “overwhelming marketshare” (at least among people I know).

Tech people clearly prefer python since it’s easier to integrate. So the tech leaders request the data science leaders to use Python, since it is much easier for the tech people. In a lot of organisations, data science reports into tech, so this request is honoured.

Even if it isn’t, if you recall, “data scientists” are generally tech facing rather than business facing. This means that the models they build need to be codified, and added to the company’s code base. This means necessarily working together with tech, and this means using a programming language that tech is comfortable with.

Then, this spills over. Usually, someone has the bright idea that the firm shouldn’t use two languages for what is essentially the same thing. And so the analytics people are also forced to use python for their analytics, even if it isn’t built for the purpose. And then it spreads.

Next is the “cool factor”. There is this impression that the more technical a solution is, the more superior it is, even if it has no direct business impact (an employer had once  told me, “I have raised money saying we are using machine learning. If our investors see the algorithms you’re proposing, they’ll want their money back”).

So a youngster getting into data wants to do “all the latest stuff”. This means machine learning. Deep learning. Reinforcement learning. And all that. There is an impression that this kind of work is “better work” compared to let’s say generating business insights using data. And in general, the packages for machine learning have traditionally been easier in Python than they are in R (though R is fast catching up, and in general python is far behind R when it comes to user friendliness).

Then, the growth in data and jobs associated with it such as machine learning or data engineering have meant that a lot of formerly tech people have got into data work. Python is fundamentally a programming language, with a package (pandas) added on to do data work. Techies find it far more intuitive than R, which is fundamentally a statistical software. On the other hand, people who are coming from a business / Excel background find it far more comfortable to use R. Python can be intimidating (I fall in this bucket).

So yeah – the tech integration, the number of tech people who are coming into data and the “cool factor” associated with the more techie stuff means that Python is gaining, at R’s expense (in my circle at least).

In any case I’m going to continue to use R. I’m at least 10X faster in R than I am in Python, and having used R for 12 years now, I’m too used to that way of working to change things up.

Statistical analysis revisited – machine learning edition

Over ten years ago, I wrote this blog post that I had termed as a “lazy post” – it was an email that I’d written to a mailing list, which I’d then copied onto the blog. It was triggered by someone on the group making an off-hand comment of “doing regression analysis”, and I had set off on a rant about why the misuse of statistics was a massive problem.

Ten years on, I find the post to be quite relevant, except that instead of “statistics”, you just need to say “machine learning” or “data science”. So this is a truly lazy post, where I piggyback on my old post, to talk about the problems with indiscriminate use of data and models.

I had written:

there is this popular view that if there is data, then one ought to do statistical analysis, and draw conclusions from that, and make decisions based on these conclusions. unfortunately, in a large number of cases, the analysis ends up being done by someone who is not very proficient with statistics and who is basically applying formulae rather than using a concept. as long as you are using statistics as concepts, and not as formulae, I think you are fine. but you get into the “ok i see a time series here. let me put regression. never mind the significance levels or stationarity or any other such blah blah but i’ll take decisions based on my regression” then you are likely to get into trouble.

The modern version of this is – everybody wants to do “big data” and “data science”. So if there is some data out there, people will want to draw insights from it. And since it is easy to apply machine learning models (thanks to open source toolkits such as the scikit-learn package in Python), people who don’t understand the models indiscriminately apply it on the data that they have got. So you have people who don’t really understand data or machine learning working with those, and creating models that are dangerous.

As long as people have idea of the models they are using, and the assumptions behind them, and the quality of data that goes into the models, we are fine. However, we are increasingly seeing cases of people using improper or biased data and applying models they don’t understand on top of them, that will have impact that affect the wider world.

So the problem is not with “artificial intelligence” or “machine learning” or “big data” or “data science” or “statistics”. It is with the people who use them incorrectly.

 

I’m not a data scientist

After a little over four years of trying to ride a buzzword wave, I hereby formally cease to call myself a data scientist. There are some ongoing assignments where that term is used to refer to me, and that usage will continue, but going forward I’m not marketing myself as a “data scientist”, and will not use the phrase “data science” to describe my work.

The basic problem is that over time the term has come to mean something rather specific, and that doesn’t represent me and what I do at all. So why did I go through this long journey of calling myself a “data scientist”, trying to fit in in the “data science community” and now exiting?

It all started with a need to easily describe what I do.

To recall, my last proper full-time job was as a Quant at a leading investment bank, when I got this idea that rather than building obscure models for trading obscure corner cases, I might as well use use my model-building skills to solve “real problems” in other industries which were back then not as well served by quants.

So I started calling myself a “Quant consultant”, except that nobody really knew what “quant” meant. I got variously described as a “technologist” and a “statistician” and “data monkey” and what not, none of which really captured what I was actually doing – using data and building models to help companies improve their businesses.

And then “data science” happened. I forget where I first came across this term, but I had been primed for it by reading Hal Varian saying that the “sexiest job in the next ten years will be statisticians”. I must mention that I had never come across the original post by DJ Patil and Thomas Davenport (that introduces the term) until I looked for it for my newsletter last year.

All I saw was “data” and “science”. I used data in my work, and I tried to bring science into the way my clients thought. And by 2014, Data Science had started becoming a thing. And I decided to ride the wave.

Now, data science has always been what artificial intelligence pioneer Marvin Minsky called a “suitcase term” – words or phrases that mean different things to different people (I heard about the concept first from this brilliant article on the “seven deadly sins of AI predictions“).

For some people, as long as some data is involved, and you do something remotely scientific it is data science. For others, it is about the use of sophisticated methods on data in order to extract insights. Some others conflate data science with statistics. For some others, only “machine learning” (another suitcase term!) is data science. And in the job market, “data scientist” can sometimes be interpreted as “glorified Python programmer”.

And right from inception, there were the data science jokes, like this one:

It is pertinent to put a whole list of it here.

‘Data Scientist’ is a Data Analyst who lives in California”
“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”
“A data scientist is a business analyst who lives in New York.”
“A data scientist is a statistician who lives in San Francisco.”
“Data Science is statistics on a Mac.”

I loved these jokes, and thought I had found this term that had rather accurately described me. Except that it didn’t.

The thing with suitcase terms is that they evolve over time, as they start getting used differentially in different contexts. And so it was with data science. Over time, it has been used in a dominant fashion by people who mean it in the “machine learning” sense of the term. In fact, in most circles, the defining features of data scientists is the ability to write code in python, and to use the scikit learn package – neither of which is my distinguishing feature.

While this dissociation with the phrase “data science” has been coming for a long time (especially after my disastrous experience in the London job market in 2017), the final triggers I guess were a series of posts I wrote on LinkedIn in August/September this year.

The good thing about writing is that it helps you clarify your mind, and as I ranted about what I think data science should be, I realised over time that what I have in mind as “data science” is very different from what the broad market has in mind as “data science”. As per the market definition, just doing science with data isn’t data science any more – instead it is defined rather narrowly as a part of the software engineering stack where problems are solved based on building machine learning models that take data as input.

So it is prudent that I stop using the phrase “data science” and “data scientist” to describe myself and the work that I do.

PS: My newsletter will continue to be called “the art of data science”. The name gets “grandfathered” along with other ongoing assignments where I use the term “data science”.