Quiz Time

This morning was Mahaquizzer, KQA’s (used to be) annual national solo written championships. When I had seen the invite a few days back, I had somehow registered in my head that the quiz was between 11:00 am and 12:30 pm.

The website says,

Reporting time for participants will be 10:00 AM

The quiz will be held across all cities from 10:30 am – 12:00 pm

But in my head I had it as 10:30 registration and quiz starting at 11:00. And so around 10, I went in for a long shower. And came out and just for the heck of it, picked my phone to confirm what time the event was. And panicked.

It was 10:55 by the time I reached the venue and started doing the quiz. This meant that rather  than the allotted 90 minutes, I only had 65 minutes to answer the 150 questions. I got into my “speed zone” (I used to be good at solving problems really really fast – that’s how I did very well in CAT etc.) and started working my way through the paper.

It was ~ 11:55 by the time I got done with my first parse of the paper, which meant there was little time for me to revise. And so for a lot of questions I ended up paying far less attention than I should have. And I left some 5-10 questions unattempted (more because I didn’t have a good answer rather than due to lack of time itself).

When the answers were given out presently, I figured I ended up making 67 out of 150. There were a few bad misses. My intuitive thought then was that had I had the full 90 minutes, I could have done better in some 5-10 questions and maybe ended up with 75 out of 150. My misreading of the time had cost me 5-10 points (and I’ll know in a few days how many places in the national rnanking).

Thinking about this, I headed out for lunch with 3 other quizzers (all of whom scored much more than me this morning). Through the lunch, we discussed all the questions. It turned out that for a bunch of questions, some of these people had over-analysed and over-thought, and ended up getting the wrong answer. Because I was doing the quiz in some insane speed mode, I didn’t have the luxury to over-analyse – I had written down the simplest and most intuitive answers I could think of.

Suddenly, by the end of the lunch (by which time we had analysed the full paper), I wasn’t sure any more on how much more I would have got had I had more time. Yes, there were 5-10 questions that would have definitely benefited from my paying more attention. On the other hand, there was another bunch of questions where more attention might have actually been damaging – I would have ended up over-analysing and turned my correct answers into wrong ones.

So I  will never really know how much more (or less) I might have got had I had the full quota of time this morning.

And now that I think of it – it is the case with my blogposts also sometimes. Most of the times I just want to bang it out and publish it, so I get into one zone and start writing. And it will be a stream of thought that will go on to this page, where you will read it .

However, when I try to write more leisurely, I make a right royal mess of it. I over-analyse, over-edit, spend needless time worrying about things I shouldn’t be worrying about, etc. In my own opinion, the best blogposts are those I have written in a “mad speed zone”. Editing can only make my writing worse.

PS: Because I was quizzing today (in the afternoon I attended Asiasweep along with Kodhi. Doing a quiz with Kodhi is always a lot of fun because we end up laughing about random things through the quiz). I deliberately decided to skip my ADHD medication for the day. And that worked out well, since I was able to make all sorts of random connections and work out the answers.

In quizzing, a little bit of hallucination can be a good thing!

Covid-19 superspreaders in Karnataka

Through a combination of luck and competence, my home state of Karnataka has handled the Covid-19 crisis rather well. While the total number of cases detected in the state edged past 2000 recently, the number of locally transmitted cases detected each day has hovered in the 20-25 range.

Perhaps the low case volume means that Karnataka is able to give out data at a level that few others states in India are providing. For each case, the rationale behind why the patient was tested (which is usually the source where they caught the disease) is given. This data comes out in two daily updates through the @dhfwka twitter handle.

There was this research that came out recently that showed that the spread of covid-19 follows a classic power law, with a low value of “alpha”. Basically, most infected people don’t infect anyone else. But there are a handful of infected people who infect lots of others.

The Karnataka data, put out by @dhfwka  and meticulously collected and organised by the folks at covid19india.org (they frequently drive me mad by suddenly changing the API or moving data into a new file, but overall they’ve been doing stellar work), has sufficient information to see if this sort of power law holds.

For every patient who was tested thanks to being a contact of an already infected patient, the “notes” field of the data contains the latter patient’s ID. This way, we are able to build a sort of graph on who got the disease from whom (some people got the disease “from a containment zone”, or out of state, and they are all ignored in this analysis).

From this graph, we can approximate how many people each infected person transmitted the infection to. Here are the “top” people in Karnataka who transmitted the disease to most people.

Patient 653, a 34 year-old male from Karnataka, who got infected from patient 420, passed on the disease to 45 others. Patient 419 passed it on to 34 others. And so on.

Overall in Karnataka, based on the data from covid19india.org as of tonight, there have been 732 cases where a the source (person) of infection has been clearly identified. These 732 cases have been transmitted by 205 people. Just two of the 205 (less than 1%) are responsible for 79 people (11% of all cases where transmitter has been identified) getting infected.

The top 10 “spreaders” in Karnataka are responsible for infecting 260 people, or 36% of all cases where transmission is known. The top 20 spreaders in the state (10% of all spreaders) are responsible for 48% of all cases. The top 41 spreaders (20% of all spreaders) are responsible for 61% of all transmitted cases.

Now you might think this is not as steep as the “well-known” Pareto distribution (80-20 distribution), except that here we are only considering 20% of all “spreaders”. Our analysis ignores the 1000 odd people who were found to have the disease at least one week ago, and none of whose contacts have been found to have the disease.

I admit this graph is a little difficult to understand, but basically I’ve ordered people found for covid-19 in Karnataka by number of people they’ve passed on the infection to, and graphed how many people cumulatively they’ve infected. It is a very clear pareto curve.

The exact exponent of the power law depends on what you take as the denominator (number of people who could have infected others, having themselves been infected), but the shape of the curve is not in question.

Essentially the Karnataka validates some research that’s recently come out – most of the disease spread stems from a handful of super spreaders. A very large proportion of people who are infected don’t pass it on to any of their contacts.

Anscombe’s Quartet and Analytics

Many “analytics professionals” or “quants” I know or have worked with have no hesitation in diving straight into a statistical model when they are faced with a problem, rather than trying to understand the data. However, that is not the way I work. Whenever I set out solving a new problem, I start with spending considerable time trying to get a feel of the data. There are many things I do to “feel” the data – look at a few lines of data, look at descriptive statistics of some of the variables and distributions of individual variables. The most powerful tool, however, that lets me get a feel for data is the humble scatterplot.

The beauty of the scatter plot is that it allows you to get a real feel for the data. Taking variables two at a time, it not only shows you how each of them is distributed but also how they are related to each other. Relationships that are not apparent when you look at the data become apparent when you graph them. I may not be wrong in saying that the scatterplot defines the direction and scope of your entire solution.

The problem with the debate on how analytics needs to be done is that it is loaded. A large majority of people who use statistics in their daily work dive straight into analysis without looking at the data. Perhaps they deem that looking at data is a waste of time? I have even seen pitch decks by extremely reputed software companies that propose solutions such as “we will solve this problem using Logistic Regression” without even having seen the data.

Let us take an example now. Take the following four data sets (my apologies for putting an image here):

source: http://lectorisalutem.wordpress.com/2011/11/17/anscombes-quartet/

Let us say you dive straight into the analysis. Like a good “analytics professional” you dive straight into regression. You may even do some descriptive statistics for each of the data sets along the way. And this is what you find (again, apologies for the image)

Source: http://lectorisalutem.wordpress.com/2011/11/17/anscombes-quartet/

Do you conclude that the four data sets are the same? Pretty much identical statistics right? I wouldn’t be surprised if you were to publish that there is nothing to differentiate between these four data sets. Now, let us do a simple scatter plot of each of these data sets and check for ourselves:

Source: http://lectorisalutem.wordpress.com/2011/11/17/anscombes-quartet/

Now, do you still think these data sets are identical? Now you know why I stress so much upon getting a feel for the data and drawing the humble scatter plot?

The data set I’ve used here is a rather famous one, and it is called Anscombe’s Quartet. The purpose of the data set is to precisely describe what I have in this post. That one needs to get a feel for the data before diving into the analysis. Draw scatter plots for every pair of variables. Understand the relationships, and let this understanding guide your further analysis. If one were able to perfectly analyze every piece of data by diving straight into a regression, the job of analytics might as well be outsourced to computers.

PS: it is a tragedy that when they teach visualization in school they don’t even mention the scatter plot. At a recent workshop I asked the participants to name the different kinds of graphs they knew. “Line”, “Bar” and “Pie” were the mots common answers. Not one answered “scatter plot”. Given the utility of this simple plot this is indeed tragic.