analytics – Pertinent Observations

Recreating Tufte, and Bangalore weather

For most of my life, I pretty much haven’t understood what the point of “recreating” is. For example, in school if someone says they were going to “act out ______’s _____” I would wonder what the point of it was – that story is well known so they might as well do something more creative.

Later on in life, maybe some 12-13 years back, I discovered the joy in “retelling known stories” – since everyone knows the story you can be far more expressive in how you tell it. Still, however, just “re-creation” (not recreation) never really fascinated me. Most of the point of doing things is to do them your way, I’ve believed (and nowadays, if you think of it, most re-creating can be outsourced to a generative AI).

And the this weekend that changed. On Saturday, I made the long-pending trip to Blossom (helped that daughter had a birthday party to attend nearby), and among other things, I bought Edward Tufte’s classic “The Visual Display of Quantitative Information“. I had read a pirated PDF of this a decade ago (when I was starting out in “data science”), but always wanted the “real thing”.

And this physical copy, designed by Tufte himself, is an absolute joy to read. And I’m paying more attention to the (really beautiful) graphics. So, when I came across this chart of New York weather, I knew I had to recreate it.

A few months earlier, I had dowloaded the dataset for Bangalore’s hourly temperature and rainfall since 1981 (i.e. a bit longer than my own life). This dataset ended in November 2022, but I wasn’t concerned. Basically, this is such a large and complex dataset that so far I had been unable to come up with an easy way to visualise it. So, when I saw this thing from Tufte, recreating would be a good idea.

I spent about an hour and half yesterday doing this. I’ve ignored the colour schemes and other “aesthetic” stuff (just realised I’ve not included the right axis in my re-creation). But I do think I’ve got something fairly good.

My re-creation of Tufte’s New York weather map, in the context of Bangalore in 2022

2022 was an unusual weather year for Bangalore and it shows in this graph. May wasn’t as hot as usual, and there were some rather cold days. Bangalore recorded its coldest October and November days since the 90s (though as this graph shows, not a record by any means). It was overall a really wet year, constantly raining from May to November. The graph shows all of it.

Also if you look at the “noraml pattern” and the records, you see Bangalore’s unusual climate (yes, I do mean “climate” and not “weather” here). Thanks to the monsoons (and pre-monsoons), April is the hottest month. Summer, this year, has already started – in the afternoons it is impossible to go out now. The minimum temperatures are remarkably consistent through the year (so except early in the mornings, you pretty much NEVER need a sweater here – at least I haven’t after I moved back from London).

There is so much more I can do. I’m glad to have come across a template to analyse the data using. Whenever I get the enthu (you know what this website is called) I’ll upload my code to produce this graph onto github or something. And when I get more enthu, I’ll make it aesthetically similar to Tufte’s graph (and include December 2022 data as well).

A day at an award function

So I got an award today. It is called “exemplary data scientist”, and was given out by the Analytics India Magazine as part of their MachineCon 2022. I didn’t really do anything to get the award, apart from existing in my current job.

I guess having been out of the corporate world for nearly a decade, I had so far completely missed out on the awards and conferences circuit. I would see old classmates and colleagues put pictures on LinkedIn collecting awards. I wouldn’t know what to make of it when my oldest friend would tell me that whenever he heard “eye of the tiger”, he would mentally prepare to get up and go receive an award (he got so many I think). It was a world alien to me.

Parallelly, I used to crib about how while I’m well networked in India, and especially in Bangalore, my networking within the analytics and data science community is shit. In a way, I was longing for physical events to remedy this, and would lament that the pandemic had killed those.

So I was positively surprised when about a month ago Analytics India Magazine wrote to me saying they wanted to give me this award, and it would be part of this in-person conference. I knew of the magazine, so after asking around a bit on legitimacy of such awards and looking at who had got it the last time round, I happily accepted.

Most of the awardees were people like me – heads of analytics or data science at some company in India. And my hypothesis that my networking in the industry was shit was confirmed when I looked at the list of attendees – of 100 odd people listed on the MachineCon website, I barely knew 5 (of which 2 didn’t turn up at the event today).

Again I might sound like a n00b, but conferences like today are classic two sided markets (read this eminently readable paper on two sided markets and pricing of the same by Jean Tirole of the University of Toulouse). On the one hand are awardees – people like me and 99 others, who are incentivised to attend the event with the carrot of the award. On the other hand are people who want to meet us, who will then pay to attend the event (or sponsor it; the entry fee for paid tickets to the event was a hefty $399).

It is like “ladies’ night” that pubs have, where on a particular days of the week, women who go to the pub get a free drink. This attracts women, which in turn attracts men who seek to court the women. And what the pub spends in subsidising the women it makes back in terms of greater revenue from the men on the night.

And so it was at today’s conference. I got courted by at least 10 people, trying to sell me cloud services, “AI services on the cloud”, business intelligence tools, “AI powered business intelligence tools”, recruitment services and the like. Before the conference, I had received LinkedIn requests from a few people seeking to sell me stuff at the conference. In the middle of the conference, I got a call from an organiser asking me to step out of the hall so that a sponsor could sell to me.

I held a poker face with stock replies like “I’m not the person who makes this purchasing decision” or “I prefer open source tools” or “we’re building this in house”.

With full benefit of hindsight, Radisson Blu in Marathahalli is a pretty good conference venue. An entire wing of the ground floor of the hotel is dedicated for events, and the AIM guys had taken over the place. While I had not attended any such event earlier, it had all the markings of a well-funded and well-organised event.

As I entered the conference hall, the first thing that struck me was the number of people in suits. Most people were in suits (though few wore ties; And as if the conference expected people to turn up in suits, the goodie bag included a tie, a pair of cufflinks and a pocket square). And I’m just not used to that. Half the days I go to office in shorts. When I feel like wearing something more formal, I wear polo T-shirts with chinos.

My colleagues who went to the NSE last month to ring the bell to take us public all turned up company T-shirts and jeans. And that’s precisely what I wore to the conference today, though I had recently procured a “formal uniform” (polo T-shirt with company logo, rather than my “usual uniform” which is a round neck T-shirt). I was pretty much the only person there in “uniform”. Towards the end of the day, I saw one other guy in his company shirt, but he was wearing a blazer over it!

Pretty soon I met an old acquaintance (who I hadn’t known would be at the conference). He introduced me to a friend, and we went for coffee. I was eating a cookie with the coffee, and had an insight – at conferences, you should eat with your left hand. That way, you don’t touch the food with the same hand you use to touch other people’s hands (surprisingly I couldn’t find sanitiser dispensers at the venue).

The talks, as expected, were nothing much to write about. Most were by sponsors selling their wares. The one talk that wasn’t by a sponsor was delivered by a guy who was introduced as “his greatgrandfather did this. His grandfather did that. And now this guy is here to talk about ethics of AI”. Full Challenge Gopalakrishna feels happened (though, unfortunately, the Kannada fellows I’d hung out with earlier that day hadn’t watched the movie).

I was telling some people over lunch (which was pretty good) that talking about ethics in AI at a conference has become like worshipping Ganesha as part of any elaborate pooja. It has become the de riguer thing to do. And so you pay obeisance to the concept and move on.

Ok so i'm really really annoyed by all this focus on "ethical AI". It is just driven by "social desirability bias". Talking about ethical AI seemingly makes you look ethical so people talk about it. fund research in it. etc.

— Karthik S (@karthiks) March 27, 2022

The awards function had three sections. The first section was for “users of AI” (from what I understood). The second (where I was included) was for “exemplary data scientists”. I don’t know what the third was for (my wife is ill today so I came home early as soon as I’d collected my award), except that it would be given by fast bowler and match referee Javagal Srinath. Most of the people I’d hung out with through the day were in the Srinath section of the awards.

Overall it felt good. The drive to Marathahalli took only 45 minutes each way (I drove). A lot of people had travelled from other cities in India to reach the venue. I met a few new people. My networking in data science and analytics is still not great, but far better than it used to be. I hope to go for more such events (though we need to figure out how to do these events without that talks).

PS: Everyone who got the award in my section was made to line up for a group photo. As we posed with our awards, an organiser said “make sure all of you hold the prizes in a way that the Intel (today’s chief sponsor) logo faces the camera”. “I guess they want Intel outside”, I joked. It seemed to be well received by the people standing around me. I didn’t talk to any of them after that, though.

The “intel outside” pic. Courtesy: https://www.linkedin.com/company/analytics-india-magazine/posts/?feedView=all

The Science in Data Science

The science in “data science” basically represents the “scientific method”.

It’s a decade since the phrase “data scientist” got coined, though if you go on LinkedIn, you will find people who claim to have more than two years of experience in the subject.

The origins of the phrase itself are unclear, though some sources claim that it came out of this HBR article in 2012 written by Thomas Davenport and DJ Patil (though, in 2009, Hal Varian, formerly Google’s Chief Economist had said that the “sexiest job of the 21st century” will be that of a statistician).

Some of you might recall that in 2018, I had said that “I’m not a data scientist any more“. That was mostly down to my experience working with companies in London, where I found that data science was used as a euphemism for “machine learning” – something I was incredibly uncomfortable with.

With the benefit of hindsight, it seems like I was wrong. My view on data science being a euphemism for machine learning came from interacting with small samples of people (though it could be an English quirk). As I’ve dug around over the years, it seems like the “science” in data science comes not from the maths in machine learning, but elsewhere.

One phenomenon that had always intrigued me was the number of people with PhDs, especially NOT in maths, computer science of statistics, who have made a career in data science. Initially I dismissed it down to “the gap between PhD and tenure track faculty positions in science”. However, the numbers kept growing.

The more perceptive of you might know that I run a podcast now. It is called “Data Chatter“, and is ten episodes old now. The basic aim of the podcast is for me to have some interesting conversations – and then release them for public benefit. Yeah, yeah.

So, there was this thing that intrigued me, and I have a podcast. I did what you would have expected me to do – get on a guest who went from a science background to data science. I got Dhanya, my classmate from school, to talk about how her background with a PhD in neuroscience has helped her become a better data scientist.

It is a fascinating conversation, and served its primary purpose of making me understand what the “science” in data science really is. I had gone into the conversation expecting to talk about some machine learning, and how that gets used in academia or whatever. Instead, we spoke for an hour about designing experiments, collecting data and testing hypotheses.

The science in “data science” basically represents the “scientific method“. What Dhanya told me (you should listen to the conversation) is that a PhD prepares you for thinking in the scientific method, and drills into you years of practice in it. And this is especially true of “experimental” PhDs.

And then, last night, while preparing the notes for the podcast release, I stumbled upon the original HBR article by Thomas Davenport and DJ Patil talking about “data science”. And I found that they talk about the scientific method as well. And I found that I had talked about it in my newsletter as well – only to forget it later. This is what I had written:

Reading Patil and Davenport’s article carefully suggests, however, that companies might be making a deliberate attempt at recruiting pure science PhDs for data scientist roles.

The following excerpts from the article (which possibly shaped the way many organisations think about data science) can help us understand why PhDs are sought after as data scientists.

Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time (Ed: the article was published in late 2012, so we’re almost “five years later” now)

Perhaps it’s becoming clear why the word “scientist” fits this emerging role. Experimental physicists, for example, also have to design equipment, gather data, conduct multiple experiments, and communicate their results.

Some of the best and brightest data scientists are PhDs in esoteric fields like ecology and systems biology.

It’s important to keep that image of the scientist in mind—because the word “data” might easily send a search for talent down the wrong path

Patil and Davenport make it very clear that traditional “data analysts” may not make for great data scientists.

We learn, and we forget, and we re-learn. But learning is precisely what the scientific method, which underpins the “science” in data science, is all about. And it is definitely NOT about machine learning.

Podcast: All Reals

I had spoken here a few times about starting a new “data podcast, right? The first episode is out today, and in this I speak to S Anand, cofounder and CEO of Gramener, about the interface of business with data science.

It’s a long freewheeling conversation, where we talk about data science in general, about Excel, about data visualisations, pie charts, Tufte and all that.

Do listen – it should be available on all podcast platforms, and let me know what you think. Oh, and don’t forget to subscribe to the podcast. New episodes will be out every Tuesday morning.

And if you think you want to be on the podcast, or know someone who wants to be a guest on the podcast, you can reach out. datachatterpodcast AT gmail.

Launching: Data Chatter

A few weeks back I had mentioned here that I’m starting a podcast. And it is now ready for release. Listen to the trailer here:

It is a series of conversations about all things data. First episode will be out on Tuesday, and then weekly after that. I’ve already built up an inventory of seven episodes. So far I’ve recorded episodes about big data, business intelligence, visualisations, a lot of “domain-specific” analytics, and the history of analytics in India. And many more are to come.

Subscribe to the podcast to be able to listen to it whenever it comes out. It is available on all podcasting platforms. For some reason, Apple is not listed on the anchor site, but if you search for “Data Chatter” on Apple Podcasts, you should find it (I did).

And of course, feedback is welcome (you can just comment on this post). And please share this podcast with whoever else you think might like it.

Should this have been my SOP?

I was chatting with a friend yesterday about analytics and “data science” and machine learning and data engineering and all that, and he commented that in his opinion a lot of the work mostly involves gathering and cleaning the data, and that any “analytics” is mostly around averaging and the sort.

This reminded me of an old newsletter I’d written way back in January 2018, soon after I’d read Raphael Honigstein‘s Das Reboot. A short discussion ensued. I sent him the link to that newsletter. And having read the bit about Das Reboot (I was talking about how SAP had helped the German national team win the 2014 FIFA World Cup) and the subsequent section of the newsletter, my friend remarked that I could have used that newsletter edition as a “statement of purpose for my job hunt”.

Now that my job hunt is done, and I’m no more in the job market, I don’t need an SOP. However, for the purpose that I don’t forget this, and keep in mind the next time I’m applying for a job, I’m reproducing a part of that newsletter here. Even if you subscribed to that newsletter, I recommend that you read it again. It’s been a long time, and this is still relevant.

Das Reboot

This is not normally the kind of book you’d see being recommended in a Data Science newsletter, but I found enough in Raphael Honigstein’s book on the German football renaissance in the last 10 years for it to merit a mention here.

So the story goes that prior to the 2014 edition of the Indian Premier League (cricket), Kolkata Knight Riders had announced a partnership with tech giant SAP, and claimed that they would use “big data insights” from SAP’s HANA system to power their analytics. Back then, I’d scoffed, since I wasn’t sure if the amount of data that’s generated in all cricket matches till then wasn’t big enough to merit “big data analytics”.

As it happens, the Knight Riders duly won that edition of the IPL. Perhaps coincidentally, SAP entered into a partnership with another champion team that year – the German national men’s football team, and Honigstein dedicates a chapter of his book to this, and other, partnerships, and the role of analytics in helping the team’s victory in that year’s World Cup.

If you look past all the marketing spiel (“HANA”, “big data”, etc.) what SAP did was to group data, generate insights and present it to the players in an easily consumable format. So in the football case, they developed an app for players where they could see videos of specific opponents doing things. It made it easy for players to review certain kinds of their own mistakes. And so on. Nothing particularly fancy; simply simple data put together in a nice easy-to-consume format.

A couple of money quotes from the book. One on what makes for good analytics systems:

‘It’s not particularly clever,’ says McCormick, ‘but its ease of use made it an effective tool. We didn’t want to bombard coaches or players with numbers. We wanted them to be able to see, literally, whether the data supported their gut feelings and intuition. It was designed to add value for a coach or athlete who isn’t that interested in analytics otherwise. Big data needed to be turned into KPIs that made sense to non-analysts.’

And this one on how good analytics can sometimes invert hierarchies, and empower the people on the front to make their own good decisions rather than always depend on direction from the top:

In its user-friendliness, the technology reversed the traditional top-down flow of tactical information in a football team. Players would pass on their findings to Flick and Löw. Lahm and Mertesacker were also allowed to have some input into Siegenthaler’s and Clemens’ official pre-match briefing, bringing the players’ perspective – and a sense of what was truly relevant on the pitch – to the table.

A lot of business analytics is just about this – presenting the existing data in an easily consumable format. There might be some statistics or machine learning involved somewhere, but ultimately it’s about empowering the analysts and managers with the right kind of data and tools. And what SAP’s experience tells us is that it may not be that bad a thing to tack on some nice marketing on top!

Hiring data scientists

I normally don’t click through on articles in my LinkedIn feed, but this article about the churn in senior data scientists caught my eye enough for me to click through and read the whole thing. I must admit to some degree of confirmation bias – the article reflected my thoughts a fair bit.

Given this confirmation bias, I’ll spare you my commentary and simply put in a few quotes:

Many large companies have fallen into the trap that you need a PhD to do data science, you don’t.

Not to mention, I have yet to see a data science program I would personally endorse. It’s run by people who have never done the job of data science outside of a lab. That’s not what you want for your company.

Doing data science and managing data science are not the same. Just like being an engineer and a product manager are not the same. There is a lot of overlap but overlap does not equal sameness.

Most data scientists are just not ready to lead the teams. This is why the failure rate of data science teams is over 90% right now. Often companies put a strong technical person in charge when they really need a strong business person in charge. I call it a data strategist.

I have worked with companies that demand agile and scrum for data science and then see half their team walk in less than a year. You can’t tell a team they will solve a problem in two sprints. If they don’t’ have the data or tools it won’t happen.

I’ll end this blog post with what my friend had to say (yesterday) about what I’d written about how SAP helped the German National team. “This is what everyone needs to do first. (All that digital transformation everyone is working on should be this kind of work)”.

I agree with him on this.

Record of my publicly available work

A few people who I’ve spoken to as part of my job hunt have asked to see some “detailed descriptions” of work that I’ve done. The other day, I put together an email with some of these descriptions. I thought it might make sense to “document” it in one place (and for me, the “obvious one place” is this blog). So here it is. As you might notice, this takes the form of an email.

I’m putting together links to some of the publicly available work that i’ve done.

1. Cricket

I have a model to evaluate and “tell the story of a cricket match”. This works for all limited overs games, and is based on a dynamic programming algorithm similar to the WASP. The basic idea is to estimate the odds of each team winning at the end of each ball, and then chart that out to come up with a “match story”.

And through some simple rules-based intelligence, the key periods in the game are marked out.

The model can also be used to evaluate the contributions of individual batsmen and bowlers towards their teams’ cause, and when aggregated across games and seasons, can be used to evaluate players’ overall contributions.

Here is a video where I explain the model and how to interpret it:

The algorithm runs live during a game. You can evaluate the latest T20 game here:

Here is a more interactive version , including a larger selection of matches going back in time.

Related to this is a cricket analytics newsletter I actively wrote during the World Cup last year. Most Indians might find this post from the newsletter interesting:

2. Covid-19

At the beginning of the pandemic (when we had just gone under a national lockdown), I had built a few agent based models to evaluate the risk associated with different kinds of commercial activities. They are described here.

Every morning, a script that I have written parses the day’s data from covid19india.org and puts out some graphs to my twitter account This is a daily fully automated feature.

Here is another agent based model that I had built to model the impact of social distancing on covid-19.

A tweetstorm based on Bayes Theorem that I wrote during the pandemic went viral enough that I got invited to a prime time news show (I didn’t go).

3. Visualisations

I used to collect bad visualisations.

I also briefly wrote a newsletter analysing “good and bad visualisations”.

4. I have an “app” to predict which single malts you might like based on your existing likes. This blogpost explains the process behind (a predecessor of ) this model.

5. I had some fun with machine learning, using different techniques to see how they perform in terms of predicting different kinds of simple patterns.

6. I used to write a newsletter on “the art of data science”.

In addition to this, you can find my articles for Mint here. Also, this page on my website as links to some anonymised case studies.

I guess that’s a lot? In any case, now I’m wondering if I did the right thing by choosing “skthewimp” as my Github username.

Scrabble

I’ve forgotten which stage of lockdown or “unlock” e-commerce for “non-essential goods” reopened, but among the first things we ordered was a Scrabble board. It was an impulse decision. We were on Amazon ordering puzzles for the daughter, and she had just about started putting together “sounds” to make words, so we thought “scrabble tiles might be useful for her to make words with”.

The thing duly arrived two or three days later. The wife had never played Scrabble before, so on the day it arrived I taught her the rules of the game. We play with the Sowpods dictionary open, so we can check words that hte opponent challenges. Our “scrabble vocabulary” has surely improved since the time we started playing (“Qi” is a lifesaver, btw).

I had insisted on ordering the “official Scrabble board” sold by Mattel. The board is excellent. The tiles are excellent. The bag in which the tiles are stored is also excellent. The only problem is that there was no “scoreboard” that arrived in the set.

On the first day we played (when I taught the wife the rules, and she ended up beating me – I’m so horrible at the game), we used a piece of paper to maintain scores. The next day, we decided to score using an Excel sheet. Since then, we’ve continued to use Excel. The scoring format looks somewhat like this.

So each worksheet contains a single day’s play. Initially after we got the board, we played pretty much every day. Sometimes multiple times a day (you might notice that we played 4 games on 3rd June). So far, we’ve played 31 games. I’ve won 19, Priyanka has won 11 and one ended in a tie.

In any case, scoring on Excel has provided an additional advantage – analytics!! I have an R script that I run after every game, that parses the Excel sheet and does some basic analytics on how we play.

For example, on each turn, I make an average of 16.8 points, while Priyanka makes 14.6. Our score distribution makes for interesting viewing. Basically, she follows a “long tail strategy”. Most of the time, she is content with making simple words, but occasionally she produces a blockbuster.

I won’t put a graph here – it’s not clear enough. This table shows how many times we’ve each made more than a particular threshold (in a single turn). The figures are cumulative

Threshold	Karthik	Priyanka
30	50	44
40	12	17
50	5	10
60	3	5
70	2	2
80	0	1
90	0	1
100	0	1

Notice that while I’ve made many more 30+ scores than her, she’s made many more 40+ scores than me. Beyond that, she has crossed every threshold at least as many times as me.

Another piece of analysis is the “score multiple”. This is a measure of “how well we use our letters”. For example, if I start place the word “tiger” on a double word score (and no double or triple letter score), I get 12 points. The points total on the tiles is 6, giving me a multiple of 2.

Over the games I have found that I have a multiple of 1.75, while she has a multiple of 1.70. So I “utilise” the tiles that I have (and the ones on the board) a wee bit “better” than her, though she often accuses me of “over optimising”.

It’s been fun so far. There was a period of time when we were addicted to the game, and we still turn to it when one of us is in a “work rut”. And thanks to maintaining scores on Excel, the analytics after is also fun.

I’m pretty sure you’re spending the lockdown playing some board game as well. I strongly urge you to use Excel (or equivalent) to maintain scores. The analytics provides a very strong collateral benefit.

This year on Spotify

I’m rather disappointed with my end-of-year Spotify report this year. I mean, I know it’s automated analytics, and no human has really verified it, etc. but there are some basics that the algorithm failed to cover.

The first few slides of my “annual report” told me that my listening changed by seasons. That in January to March, my favourite artists were Black Sabbath and Pink Floyd, and from April to June they were Becky Hill and Meduza. And that from July onwards it was Sigala.

Now, there was a life-changing event that happened in late March which Spotify knows about, but failed to acknowledge in the report – I moved from the UK to India. And in India, Spotify’s inventory is far smaller than it is in the UK. So some of the bands I used to listen to heavily in the UK, like Black Sabbath, went off my playlist in India. My daughter’s lullaby playlist, which is the most consumed music for me, moved from Spotify to Amazon Music (and more recently to Apple Music).

The other thing with my Spotify use-case is that it’s not just me who listens to it. I share the account with my wife and daughter, and while I know that Spotify has an algorithm for filtering out kid stuff, I’m surprised it didn’t figure out that two people are sharing this account (and pitched us a family subscription).

According to the report, these are the most listened to genres in 2019:

Now there are two clear classes of genres here. I’m surprised that Spotify failed to pick it out. Moreover, the devices associated with my account that play Rock or Power Metal are disjoint from the devices that play Pop, EDM or House. It’s almost like Spotify didn’t want to admit that people share accounts.

Then some three slides on my podcast listening for the year, when I’ve overall listened to five hours of podcasts using Spotify. If I, a human, were building this report, I would have dropped this section citing insufficient data, rather than wasting three slides with analytics that simply don’t make sense.

I see the importance of this segment in Spotify’s report, since they want to focus more on podcasts (being an “audio company” rather than a “music company”), but maybe something in the report to encourage me to use Spotify for more podcasts (maybe recommending Spotify’s exclusive podcasts that I might like, be it based on limited data?) might have helped.

Finally, take a look at my our most played songs in 2019.

It looks like my daughter’s sleeping playlist threaded with my wife’s favourite songs (after a point the latter dominate). “My songs” are nowhere to be found – I have to go all the way down to number 23 to find Judas Priest’s cover of Diamonds and Rust. I mean I know I’ve been diversifying the kind of music that I listen to, while my wife listens to pretty much the same stuff over and over again!

In any case, automated analytics is all fine, but there are some not-so-edge cases where the reports that it generates is obviously bad. Hopefully the people at Spotify will figure this out and use more intelligence in producing next year’s report!

EPL: Mid-Season Review

Going into the November international break, Liverpool are eight points ahead at the top of the Premier League. Defending champions Manchester City have slipped to fourth place following their loss to Liverpool. The question most commentators are asking is if Liverpool can hold on to this lead.

We are two-thirds of the way through the first round robin of the premier league. The thing with evaluating league standings midway through the round robin is that it doesn’t account for the fixture list. For example, Liverpool have finished playing the rest of the “big six” (or seven, if you include Leicester), but Manchester City have many games to go among the top teams.

So my practice over the years has been to compare team performance to corresponding fixtures in the previous season, and to look at the points difference. Then, assuming the rest of the season goes just like last year, we can project who is likely to end up where.

Now, relegation and promotion introduces a source of complication, but we can “solve” that by replacing last season’s relegated teams with this season’s promoted teams (18th by Championship winners, 19th by Championship runners-up, and 20th by Championship playoff winners).

It’s not the first time I’m doing this analysis. I’d done it once in 2013-14, and once in 2014-15. You will notice that the graphs look similar as well – that’s how lazy I am.

Anyways, this is the points differential thus far compared to corresponding fixtures of last season.

Leicester are the most improved team from last season, having scored 8 points more than in corresponding fixtures from last season. Sheffield United, albeit starting from a low base, have done extremely well as well. And last season’s runners-up Liverpool are on a plus 6.

The team that has done worst relative to last season is Tottenham Hotspur, at minus 13. Key players entering the final years of their contract and not signing extensions, and scanty recruitment over the last 2-3 years, haven’t helped. And then there is Manchester City at minus 9!

So assuming the rest of the season’s fixtures go according to last season’s corresponding fixtures, what will the final table look like at the end of the season?
We see that if Liverpool replicate their results from last season for the rest of the fixtures, they should win the league comfortably.

What is more interesting is the gaps between 1-2, 2-3 and 3-4. Each of the top three positions is likely to be decided “comfortably”, with a fairly congested mid-table.

As mentioned earlier, this kind of analysis is unfair to the promoted teams. It is highly unlikely that Sheffield will get relegated based on the start they’ve had.

We’ll repeat this analysis after a couple of months to see where the league stands!