sampling – Pertinent Observations

Covid-19 Prevalence in Karnataka

Finally, many months after other Indian states had conducted a similar exercise, Karnataka released the results of its first “covid-19 sero survey” earlier this week. The headline number being put out is that about 27% of the state has already suffered from the infection, and has antibodies to show for it. From the press release:

Out of 7.07 crore estimated populationin Karnataka, the study estimates that 1.93 crore (27.3%) of the people are either currently infected or already had the infection in the past, as of 16 September 2020.

To put that number in context, as of 16th September, there were a total of 485,000 confirmed cases in Karnataka (official statistics via covid19india.org), and 7536 people had died of the disease in the state.

It had long been estimated that official numbers of covid-19 cases are off by a factor of 10 or 20 – that the actual number of people who have got the disease is actually 10 to 20 times the official number. The serosurvey, assuming it has been done properly, suggests that the factor (as of September) is 40!

If the ratio has continued to hold (and the survey accurate), nearly one in two people in Karnataka have already got the disease! (as of today, there are 839,000 known cases in Karnataka)

Of course, there are regional variations, though I should mention that the smaller the region you take, the less accurate the survey will be (smaller sample size and all that). In Bangalore Urban, for example, the survey estimates that 30% of the population had been infected by mid-September. If the ratio holds, we see that nearly 60% of the population in the city has already got the disease.

The official statistics (separate from the survey) also suggest that the disease has peaked in Karnataka. In fact, it seems to have peaked right around the time the survey was being conducted, in September. In September, it was common to see 7000-1000 new cases confirmed in Karnataka each day. That number has come down to about 3000 per day now.

Now, there are a few questions we need to answer. Firstly – is this factor of 40 (actual cases to known cases) feasible? Based on this data point, it makes sense:

The number of active cases in Karnataka has increased from 491 on 14 May to 811 on 19 May.

Most of these recent cases are from among the people who are returning to the state from other states & countries.

93% of the currently active cases are asymptomatic. pic.twitter.com/TjTEZdlK1w

— ??? ?? ????? L K Atheeq (@lkatheeq) May 19, 2020

In May, when Karnataka had a very small number of “native cases” and was aggressively testing everyone who had returned to the state from elsewhere, a staggering 93% of currently active cases were asymptomatic. In other words, only 1 in 14 people who was affected was showing any sign of symptoms.

Then, as I might have remarked on Twitter a few times, compulsory quarantining or hospitalisation (which was in force until July IIRC) has been a strong disincentive to people from seeking medical help or getting tested. This has meant that people get themselves tested only when the symptoms are really clear, or when they need attention. The downside of this, of course, has been that many people have got themselves tested too late for help. One statistic I remember is that about 33% of people who died of covid-19 in hospitals died within 24 hours of hospitalisation.

So if only one in 14 show any symptoms, and only those with relatively serious symptoms (or with close relatives who have serious symptoms) get themselves tested, this undercount by a factor of 40 can make sense.

Then – does the survey makes sense? Is 15000 samples big enough for a state of 70 million? For starters, the population of the state doesn’t matter. Rudimentary statistics (I always go to this presentation by Rajeeva Karandikar of CMI) tells us that the size of the population doesn’t matter. As long as the sample has been chosen randomly, all that matters for the accuracy of the survey is the size of the sample. And for a binary decision (infected / not), 15000 is good enough as long as the sample has been random.

And that is where the survey raises questions – the survey has used an equal number of low risk, high risk and medium risk samples. “High risk” have been defined as people with comorbidities. Moderate risk are people who interact a lot with a lot of people (shopkeepers, healthcare workers, etc.). Both seem fine. It’s the “low risk” that seems suspect, where they have included pregnant women and attendants of outpatient patients in hospitals.

I have a few concerns – are the “low risk” low risk enough? Doesn’t the fact that you have accompanied someone to hospital, or gone to hospital yourself (because you are pregnant), make you higher than average risk? And then – there are an equal number of low risk, medium risk and high risk people in the sample and there doesn’t seem to be any re-weighting. This suggests to me that the medium and high risk people have been overrepresented in the sample.

Finally, the press release says:

We excluded those already diagnosed with SARS-CoV2 infection, unwilling to provide a sample for the test, or did not agree to provide informed consent

I wonder if this sort of exclusion doesn’t result in a bias in itself.

Putting all this together – that there are qual samples of low, medium and high risk, that the “low risk” sample itself contains people of higher than normal risk, and that people who have refused to participate in the survey have been excluded – I sense that the total prevalence of covid-19 in Karnataka is likely to be overstated. By what factor, it is impossible to say. Maybe our original guess that the incidence of the disease is about 20 times the number of known cases is still valid? We will never know.

Nevertheless, we can be confident that a large section of the state (may not be 50%, but maybe 40%?) has already been infected with covid-19 and unless the ongoing festive season plays havoc, the number of cases is likely to continue dipping.

However, this is no reason to be complacent. I think Nitin Pai is bang on here.

And I know a lot of people who have been aggressively social distancing (not even meeting people who have domestic help coming home, etc.). It is important that when they do relax, they do so in a graded manner.

Wear masks. Avoid crowded closed places. If you are going to get covid-19 anyway (and many of us have already got it, whether we know it or not), it is significantly better for you that you get a small viral load of it.

Rare observations and observed distributions

Over the last four years, one of my most frequent commutes in Bangalore has been between Jayanagar and Rajajinagar – I travel between these two places once a week on an average. There are several routes one can take to get to Rajajinagar from Jayanagar, and one of them happens to be from the inside of Chamrajpet. However, I can count the number of times I’ve taken that route in the last four years on the fingers of one hand. This is because the first time I took that route I got stuck in a massive traffic jam.

Welcome to the world of real distributions and observed distributions. The basic concept is that if you observe a particular event rarely, the distribution you observe can be very different from the actual distribution. Take for example, the above example of driving through inner Chamrajpet. Let us say that the average time to drive through that particular road on a Saturday evening is 10 minutes. Let us say that 99% of the time on a Saturday evening, you take less than 15 minutes to drive through that road. In the remaining 1% of the time, you can take as much as an hour to drive through the road.

Now, if you are a regular commuter who drives through this road every Saturday evening, you will be aware of the distribution. You will be aware that 99% of the time you will take at most 15 minutes to get past, and base your routing decision based on that. When it takes an hour to drive past, you know that it is a rare event and discount it from your future calculations. If, however, you are an irregular commuter like me and happened to drive through that road on that one day when it took an hour you get past, you will assume that that is the average time it takes to get past! You are likely to mistake the rare event as the usual, and that can lead to suboptimal decisions in the future.

In his book The Black Swan, Nassim Nicholas Taleb talks about the inability of people to model for rare events. He says that the problem is that people underestimate the probability of rare events and fail to account for it in their models, leading to blow ups when they do occur. While I agree that is a problem, I contend that the opposite problem can also be not ignored. Sometimes you fail to recognize that what has happened to you is a rare event and thus end up with a wrong model.

Let me illustrate both problems with the same example. Think of a game where 99 times out of 100 you win a rupee. The rest of the time (i.e. 1%) you lose fifty rupees. Regular players of the game, who have “sampled” this enough will know the full distribution, and will take that into account when deciding on whether to play the game. Non-regular players, however, don’t have complete information.

Let us say there are a hundred cards. 99 of them have a +1 written on it, and the 100th has a -50. Let us suppose you pick ten cards. Ninety percent of the time, all ten cards you pick will be a “+1”, and you will conclude that all cards are “+1”. You will model for the game to give you a rupee each time you play. The other 10% of the time, however, you will draw nine +1s and one -50. You will then assume that the expected value of playing the game is Rs. -4 .1( (9 * 1 + 1 * (-50))/10 ). Notice that both times you are wrong in your inference!

So while it is important that you recognize black swans, it is also important that you don’t overestimate their probability. Always remember that if you are a rare observer, the distribution you observe may not reflect the real distribution.

New York food recommendations

So I’m in New York for the next 2 weeks (6-18), and like last time want to do this sampling of high-quality cuisine from around the world. Meals are expense-able so cost isn’t so much of an issue, but given that I’ll be eating 13 dinners there I want to choose the places carefully.

Off the cuff, this is what I broadly want to eat. So plis to be recommending where to go:

high-quality thin crust pizza
pasta
hummus-falafel-…
mexican stuff – i’ll go to chipotle at least once for sure. any other places?
thai – something I missed out on on my last trip there.
ethiopian – considering I don’t get to eat this in India I want to eat this at least twice
french
Korean – I’m going to Hangawi once again for sure. Sheer awesomeness it is

Ok I’m sure there are other cuisines whose existence I’m not even aware of, so if you think there’s something I might like plis to be recommending. And if you live in NYC and want to meet for a meal, let me know. We’ll plan something.

And here is the list of places I went to during my last visit.

Update: I’ll be staying at the Hilton Millennium, next to the WTC site.