Unbundling Higher Education

In July 2004, I went to Madras, wore fancy clothes and collected a laminated piece of paper. The piece of paper, formally called “Bachelor of Technology”, certified three things.

First, it said that I had (very likely) got a very good rank in IIT JEE, which enabled me to enrol in the Computer Science B.Tech. program at IIT Madras.  Then, it certified that I had attended a certain number of lectures and laboratories (equivalent to “180 credits”) at IIT Madras. Finally, it certified that I had completed assignments and passed tests administered by IIT Madras to a sufficient degree that qualified me to get the piece of paper.

Note that all these three were necessary conditions to my getting my degree from IIT Madras. Not passing IIT JEE with a fancy enough rank would have precluded me from the other two steps in the first place. Either not attending lectures and labs, or not doing the assignments and exams, would  have meant that my “coursework would be incomplete”, leaving me ineligible to get the degree.

In other words, my higher education was bundled. There is no reason that should be so.

There is no reason that a single entity should have been responsible for entry testing (which is what IIT-JEE essentially is), teaching and exit testing. Each of these three could have been done by an independent entity.

For example, you could have “credentialing entities” or “entry testing entities”, whose job is to test you on things that admissions departments of colleges might test you on. This could include subject tests such as IIT-JEE, or aptitude tests such as GRE, or even evaluations of extra-curricular activities, recommendation letters and essays as practiced in American universities.

Then, you could have “teaching entities”. This is like the MOOCs we already have. The job of these teaching entities is to teach a subject or set of subjects, and make sure you understood your stuff. Testing whether you had learnt the stuff, however, is not the job of the teaching entities. It is also likely that unless there are superstar teachers, the value of these teaching entities comes down, on account of marginal cost pricing, close to zero.

To test whether you learnt your stuff, you have the testing entities. Their job is to test whether your level of knowledge is sufficient to certify that you have learnt a particular subject.  It is well possible that some testing entities might demand that you cleared a particular cutoff on entry tests before you are eligible to get their exit test certificates, but many others may not.

The only downside of this unbundling is that independent evaluation becomes difficult. What do you make of a person who has cleared entry tests  mandated by a certain set of institutions, and exit tests traditionally associated with a completely different set of institutions? Is the entry test certificate (and associated rank or percentile) enough to give you a particular credential or should it be associated with an exit test as well?

These complications are possibly why higher education hasn’t experimented with any such unbundling so far (though MOOCs have taken the teaching bit outside the traditional classroom).

However, there is an opportunity now. Covid-19 means that several universities have decided to have online-only classes in 2019-20. Without the peer learning aspect, people are wondering if it is worth paying the traditional amount for these schools. People are also calling for top universities to expand their programs since the marginal cost is slipping further, with the backlash being that this will “dilute” the degrees.

This is where unbundling comes into play. Essentially anyone should be able to attend the Harvard lectures, and maybe even do the Harvard exams (if this can be done with a sufficiently low marginal cost). However, you get a Harvard degree if and only if you have cleared the traditional Harvard admission criteria (maybe the rest get a diploma or something?).

Some other people might decide upon clearing the traditional Harvard admission criteria that this credential itself is sufficient for them and not bother getting the full degree. The possibilities are endless.

Old-time readers of this blog might remember that I had almost experimented with something like this. Highly disillusioned during my first year at IIT, I had considered dropping out, reasoning that my JEE rank was credential enough. Finally, I turned out to be too much of a chicken to do that.

What is the Case Fatality Rate of Covid-19 in India?

The economist in me will give a very simple answer to that question – it depends. It depends on how long you think people will take from onset of the disease to die.

The modeller in me extended the argument that the economist in me made, and built a rather complicated model. This involved smoothing, assumptions on probability distributions, long mathematical derivations and (for good measure) regressions.. And out of all that came this graph, with the assumption that the average person who dies of covid-19 dies 20 days after the thing is detected.

 

Yes, there is a wide variation across the country. Given that the disease is the same and the treatment for most people diseased is pretty much the same (lots of rest, lots of water, etc), it is weird that the case fatality rate varies by so much across Indian states. There is only one explanation – assuming that deaths can’t be faked or miscounted (covid deaths attributed to other reasons or vice versa), the problem is in the “denominator” – the number of confirmed cases.

What the variation here tells us is that in states towards the top of this graph, we are likely not detecting most of the positive cases (serious cases will get themselves tested anyway, and get hospitalised, and perhaps die. It’s the less serious cases that can “slip”). Taking a state low down below in this graph as a “good tester” (say Andhra Pradesh), we can try and estimate what the extent of under-detection of cases in each state is.

Based on state-wise case tallies as of now (might be some error since some states might have reported today’s number and some mgiht not have), here are my predictions on how many actual number of confirmed cases there are per state, based on our calculations of case fatality rate.

Yeah, Maharashtra alone should have crossed a million caess based on the number of people who have died there!

Now let’s get to the maths. It’s messy. First we look at the number of confirmed cases per day and number of deaths per day per state (data from here). Then we smooth the data and take 7-day trailing moving averages. This is to get rid of any reporting pile-ups.

Now comes the probability assumption – we assume that a proportion p of all the confirmed cases will die. We assume an average number of days (N) to death for people who are supposed to die (let’s call them Romeos?). They all won’t pop off exactly N days after we detect their infection. Let’s say a proportion \lambda dies each day. Of everyone who is infected, supposed to die and not yet dead, a proportion \lambda will die each day.

My maths has become rather rusty over the years but a derivation I made shows that \lambda = \frac{1}{N}. So if people are supposed to die in an average of 20 days, \frac{1}{20} will die today, \frac{19}{20}\frac{1}{20} will die tomorrow. And so on.

So people who die today could be people who were detected with the infection yesterday, or the day before, or the day before day before (isn’t it weird that English doesn’t a word for this?) or … Now, based on how many cases were detected on each day, and our assumption of p (let’s assume a value first. We can derive it back later), we can know how many people who were found sick k days back are going to die today. Do this for all k, and you can model how many people will die today.

The equation will look something like this. Assume d_t is the number of people who die on day t and n_t is the number of cases confirmed on day t. We get

d_t = p  (\lambda n_{t-1} + (1-\lambda) \lambda n_{t-2} + (1-\lambda)^2 \lambda n_{t-3} + ... )

Now, all these ns are known. d_t is known. \lambda comes from our assumption of how long people will, on average, take to die once their infection has been detected. So in the above equation, everything except p is known.

And we have this data for multiple days. We know the left hand side. We know the value in brackets on the right hand side. All we need to do is to find p, which I did using a simple regression.

And I did this for each state – take the number of confirmed cases on each day, the number of deaths on each day and your assumption on average number of days after detection that a person dies. And you can calculate p, which is the case fatality rate. The true proportion of cases that are resulting in deaths.

This produced the first graph that I’ve presented above, for the assumption that a person, should he die, dies on an average 20 days after the infection is detected.

So what is India’s case fatality rate? While the first graph says it’s 5.8%, the variations by state suggest that it’s a mild case detection issue, so the true case fatality rate is likely far lower. From doing my daily updates on Twitter, I’ve come to trust Andhra Pradesh as a state that is testing well, so if we assume they’ve found all their active cases, we use that as a base and arrive at the second graph in terms of the true number of cases in each state.

PS: It’s common to just divide the number of deaths so far by number of cases so far, but that is an inaccurate measure, since it doesn’t take into account the vintage of cases. Dividing deaths by number of cases as of a fixed point of time in the past is also inaccurate since it doesn’t take into account randomness (on when a Romeo might die).

Anyway, here is my code, for what it’s worth.

deathRate <- function(covid, avgDays) {
covid %>%
mutate(Date=as.Date(Date, '%d-%b-%y')) %>%
gather(State, Number, -Date, -Status) %>%
spread(Status, Number) %>%
arrange(State, Date) -> 
cov1

# Need to smooth everything by 7 days 
cov1 %>%
arrange(State, Date) %>%
group_by(State) %>%
mutate(
TotalConfirmed=cumsum(Confirmed),
TotalDeceased=cumsum(Deceased),
ConfirmedMA=(TotalConfirmed-lag(TotalConfirmed, 7))/7,
DeceasedMA=(TotalDeceased-lag(TotalDeceased, 7))/ 7
) %>%
ungroup() %>%
filter(!is.na(ConfirmedMA)) %>%
select(State, Date, Deceased=DeceasedMA, Confirmed=ConfirmedMA) ->
cov2

cov2 %>%
select(DeathDate=Date, State, Deceased) %>%
inner_join(
cov2 %>%
select(ConfirmDate=Date, State, Confirmed) %>%
crossing(Delay=1:100) %>%
mutate(DeathDate=ConfirmDate+Delay), 
by = c("DeathDate", "State")
) %>%
filter(DeathDate > ConfirmDate) %>%
arrange(State, desc(DeathDate), desc(ConfirmDate)) %>%
mutate(
Lambda=1/avgDays,
Adjusted=Confirmed * Lambda * (1-Lambda)^(Delay-1)
) %>%
filter(Deceased > 0) %>%
group_by(State, DeathDate, Deceased) %>%
summarise(Adjusted=sum(Adjusted)) %>%
ungroup() %>%
lm(Deceased~Adjusted-1, data=.) %>%
summary() %>%
broom::tidy() %>%
select(estimate) %>%
first() %>%
return()
}

Tests per positive case

I seem to be becoming a sort of “testing expert”, though the so-called “testing mafia” (ok I only called them that) may disagree. Nothing external happened since the last time I wrote about this topic, but here is more “expertise” from my end.

As some of you might be aware, I’ve now created a script that does the daily updates that I’ve been doing on Twitter for the last few weeks. After I went off twitter last week, I tried for a couple of days to get friends to tweet my graphs. That wasn’t efficient. And I’m not yet over the twitter addiction enough to log in to twitter every day to post my daily updates.

So I’ve done what anyone who has a degree in computer science, and who has a reasonable degree of self-respect, should do – I now have this script (that runs on my server) that generates the graph and some mildly “intelligent” commentary and puts it out at 8am everyday. Today’s update looked like this:

Sometimes I make the mistake of going to twitter and looking at the replies to these automated tweets (that can be done without logging in). Most replies seem to be from the testing mafia. “All this is fine but we’re not testing enough so can’t trust the data”, they say. And then someone goes off on “tests per million” as if that is some gold standard.

As I discussed in my last post on this topic, random testing is NOT a good thing here. There are several ethical issues with that. The error rates with the testing means that there is a high chance of false positives, and also false negatives. So random testing can both “unleash” infected people, and unnecessarily clog hospital capacity with uninfected.

So if random testing is not a good metric on how adequately we are testing, what is? One idea comes from this Yahoo report on covid management in Vietnam.

According to data published by Vietnam’s health ministry on Wednesday, Vietnam has carried out 180,067 tests and detected just 268 cases, 83% of whom it says have recovered. There have been no reported deaths.

The figures are equivalent to nearly 672 tests for every one detected case, according to the Our World in Data website. The next highest, Taiwan, has conducted 132.1 tests for every case, the data showed

Total tests per positive case. Now, that’s an interesting metric. The basic idea is that if most of the people we are testing show positive, then we simply aren’t testing enough. However, if we are testing a lot of people for every positive case, then it means that we are also testing a large number of marginal cases (there is one caveat I’ll come to).

Also, tests per positive case also takes the “base rate” into effect. If a region has been affected massively, then the base rate itself will be high, and the region needs to test more. A less affected region needs less testing (remember we only  test those with a high base rate). And it is likely that in a region with a higher base rate, more positive cases are found (this is a deadly disease. So anyone with more than a mild occurrence of the disease is bound to get themselves tested).

The only caveat here is that the tests need to be “of high quality”, i.e. they should be done on people with high base rates of having the disease. Any measure that becomes a metric is bound to be gamed, so if tests per positive case becomes a metric, it is easy for a region to game that by testing random people (rather than those with high base rates). For now, let’s assume that nobody has made this a “measure” yet, so there isn’t that much gaming yet.

So how is India faring? Based on data from covid19india.org, until yesterday India had done (as of yesterday, 23rd April) about 520,000 tests, of which about 23,000 people have tested positive. In other words, India has tested 23 people for every positive test. Compared to Vietnam (or even Taiwan) that’s a really low number.

However, different states are testing to different extents by this metric. Again using data from covid19india.org, I created this chart that shows the cumulative “tests per positive case” in each state in India. I drew each state in a separate graph, with different scales, because they were simply not comparable.

Notice that Maharashtra, our worst affected state is only testing 14 people for every positive case, and this number is going down over time. Testing capacity in that state (which has, on an absolute number, done the maximum number of tests) is sorely stretched, and it is imperative that testing be scaled up massively there. It seems highly likely that testing has been backlogged there with not enough capacity to test the high base rate cases. Gujarat and Delhi, other badly affected states, are also in similar boats, testing only 16 and 13 people (respectively) for every infected person.

At the other end, Orissa is doing well, testing 230 people for every positive case (this number is rising). Karnataka is not bad either, with about 70 tests per case  (again increasing. The state massively stepped up on testing last Thursday). Andhra Pradesh is doing nearly 60. Haryana is doing 65.

Now I’m waiting for the usual suspects to reply to this (either on twitter, or as a comment on my blog) saying this doesn’t matter we are “not doing enough tests per million”.

I wonder why some people are proud to show off their innumeracy (OK fine, I understand that it’s a bit harsh to describe someone who doesn’t understand Bayes’s Theorem as “innumerate”).

 

More on covid testing

There has been a massive jump in the number of covid-19 positive cases in Karnataka over the last couple of days. Today, there were 44 new cases discovered, and yesterday there were 36. This is a big jump from the average of about 15 cases per day in the preceding 4-5 days.

The good news is that not all of this is new infection. A lot of cases that have come out today are clusters of people who have collectively tested positive. However, there is one bit from yesterday’s cases (again a bunch of clusters) that stands out.

Source: covid19india.org

I guess by now everyone knows what “travelled from Delhi” is a euphemism for. The reason they are interesting to me is that they are based on a “repeat test”. In other words, all these people had tested negative the first time they were tested, and then they were tested again yesterday and found positive.

Why did they need a repeat test? That’s because the sensitivity of the Covid-19 test is rather low. Out of every 100 infected people who take the test, only about 70 are found positive (on average) by the test. That also depends upon when the sample is taken.  From the abstract of this paper:

Over the four days of infection prior to the typical time of symptom onset (day 5) the probability of a false negative test in an infected individual falls from 100% on day one (95% CI 69-100%) to 61% on day four (95% CI 18-98%), though there is considerable uncertainty in these numbers. On the day of symptom onset, the median false negative rate was 39% (95% CI 16-77%). This decreased to 26% (95% CI 18-34%) on day 8 (3 days after symptom onset), then began to rise again, from 27% (95% CI 20-34%) on day 9 to 61% (95% CI 54-67%) on day 21.

About one in three (depending upon when you draw the sample) infected people who have the disease are found by the test to be uninfected. Maybe I should state it again. If you test a covid-19 positive person for covid-19, there is almost a one-third chance that she will be found negative.

The good news (at the face of it) is that the test has “high specificity” of about 97-98% (this is from conversations I’ve had with people in the know. I’m unable to find links to corroborate this), or a false positive rate of 2-3%. That seems rather accurate, except that when the “prior probability” of having the disease is low, even this specificity is not good enough.

Let’s assume that a million Indians are covid-19 positive (the official numbers as of today are a little more than one-hundredth of that number). With one and a third billion people, that represents 0.075% of the population.

Let’s say we were to start “random testing” (as a number of commentators are advocating), and were to pull a random person off the street to test for Covid-19. The “prior” (before testing) likelihood she has Covid-19 is 0.075% (assume we don’t know anything more about her to change this assumption).

If we were to take 20000 such people, 15 of them will have the disease. The other 19985 don’t. Let’s test all 20000 of them.

Of the 15 who have the disease, the test returns “positive” for 10.5 (70% accuracy, round up to 11). Of the 19985 who don’t have the disease, the test returns “positive” for 400 of them (let’s assume a specificity of 98% (or a false positive rate of 2%), placing more faith in the test)! In other words, if there were a million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 11/411 = 2.6%.

If there were 10 million covid-19 positive people in India (no harm in supposing), then the “base rate” would be .75%. So out of our sample of 20000, 150 would have the disease. Again testing all 20000, 105 of the 150 who have the disease would test positive. 397 of the 19850 who don’t have the disease will test positive. In other words, if there were ten million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 105/(397+105) = 21%.

If there were ten million Covid-19 positive people in India, only one-fifth of the people who tested positive in a random test would actually have the disease.

Take a sip of water (ok I’m reading The Ken’s Beyond The First Order too much nowadays, it seems).

This is all standard maths stuff, and any self-respecting book or course on probability and Bayes’s Theorem will have at least a reference to AIDS or cancer testing. The story goes that this was a big deal in the 1990s when some people suggested that the AIDS test be used widely. Then, once this problem of false positives and posterior probabilities was pointed out, the strategy of only testing “high risk cases” got accepted.

And with a “low incidence” disease like covid-19, effective testing means you test people with a high prior probability. In India, that has meant testing people who travelled abroad, people who have come in contact with other known infected, healthcare workers, people who attended the Tablighi Jamaat conference in Delhi, and so on.

The advantage with testing people who already have a reasonable chance of having the disease is that once the test returns positive, you can be pretty sure they actually have the disease. It is more effective and efficient. Testing people with a “high prior probability of disease” is not discriminatory, or a “sampling bias” as some commentators alleged. It is prudent statistical practice.

Again, as I found to my own detriment with my tweetstorm on this topic the other day, people are bound to see politics and ascribe political motives to everything nowadays. In that sense, a lot of the commentary is not surprising. It’s also not surprising that when “one wing” heavily retweeted my article, “the other wing” made efforts to find holes in my argument (which, again, is textbook math).

One possibly apolitical criticism of my tweetstorm was that “the purpose of random testing is not to find out who is positive. It is to find out what proportion of the population has the disease”. The cost of this (apart from the monetary cost of actually testing) are threefold. Firstly, a large number of uninfected people will get hospitalised in covid-specific hospitals, clogging hospital capacity and increasing the chances that they get infected while in hospital.

Secondly, getting a truly random sample in this case is tricky, and possibly unethical. When you have limited testing capacity, you would be inclined (possibly morally, even) to use it on people who already have a high prior probability.

Finally, when the incidence is small, we need a really large sample to find out the true range.

Let’s say 1 in 1000 Indians have the disease (or about 1.35 million people). Using the Chi Square test of proportions, our estimate of the incidence of the disease varies significantly on how many people are tested.

If we test a 1000 people and find 1 positive, the true incidence of the disease (95% confidence interval) could be anywhere from 0.01% to 0.65%.

If we test 10000 people and find 10 positive, the true incidence of the disease could be anywhere between 0.05% and 0.2%.

Only if we test 100000 people (a truly massive random sample) and find 100 positive, then the true incidence lies between 0.08% and 0.12%, an acceptable range.

I admit that we may not be testing enough. A simple rule of thumb is that anyone with more than a 5% prior probability of having the disease needs to be tested. How we determine this prior probability is again dependent on some rules of thumb.

I’ll close by saying that we should NOT be doing random testing. That would be unethical on multiple counts.

Programming assignments and blind men and the elephant

Evaluating a tough programming elephant is like the story of the blind men and the elephant. Let me explain.

The assignments that I’ve handed out as part of my Spreadsheet Modelling for Business Decision Problems course at IIMB involve fairly complex spreadsheet modelling (as the name of the course suggests). Thus, while it is a lot of effort on behalf of the student to do the assignment, it is also a lot of effort on my behalf if I’ve to go through the code line by line (these guys code using VBA macros), understand it and evaluate them.

Instead, I have come up with a set of “tests” – specific inputs that I give to the program (I’ve specified what the “front sheet” should look like so this is easy), and then see if the program gives out the desired outputs. Either way, I dig a little deeper and see if they’ve done it right, and based on that I grade the assignment.

If the assignments that they’ve turned in are elephants, it’s too much of an effort for me to open my eyes and actually see that they are elephants. Hence, I feel around, and check for a few different components to make sure they’ve submitted elephants. So for this assignment I might check if the trunk is like a snake, and if so, they’ve passed. For another assignment, I might check if the legs are like trees, and if they are, pass them. And so forth.

Now, this is evidently not perfect. For example, if you know that I’ll only check for the trunk to be like a snake, you’ll just submit a trunk that’s like a snake rather than submitting a full assignment! But if you don’t know what I’m going to check for, then it might be possible that you’ve only submitted a snake, I look for treetrunks and not finding them, give you a failing grade! There is a little bit of luck involved on both sides, but that’s how things work!

Extending this analogy to software testing, you can think of that too as an exercise of blind men learning about an elephant. The testers are the blind men of Indostan, trying to find out if the piece of code they’ve been given is an elephant. Each tester pokes around at a different part of the beast, trying to confirm if it fits what they’re looking for. And if the beast has a knife, a snake, a fan, a wall, a tree and a rope as part of it, it is declared as an elephant!

Speaking of software testing, I came across this brilliant video of a class in Hyderabad where software testing is being taught. Enjoy (HT: V Vinay).

https://www.youtube.com/watch?v=4Cir7_S6OSY

Calibration and test sets

When you’re doing any statistical analysis, the standard thing to do is to divide your data into “calibration” and “test” data sets. You build the model on the “calibration” data set, and then test it on the “test” data set. The purpose of this slightly complicated procedure is so that you don’t “overfit” your model.

Overfitting is the process where in your attempt to find a superior model, you build a model that is too tailored to your data, and when you apply it to a different data set, it can fail spectacularly. By setting aside some of your data as a “test” data set, you make sure that the model that you built is not too calibrated to the data use used to calibrate it to.

Now, there are several methods in which you can divide your data into “calibration” and “test” data sets. One method is to use a random number generator, and randomly divide the data into two parts – typically the calibration data set is about three times as big as the test data set (this is the rule I normally use, but there is no sanctity to this). The problem with this method, however, is that if you are building a model based on data collected at different points in time, any systematic change in behaviour over time cannot be captured by the model, and it loses predictive value. Let me explain.

Let us say that we are collecting some data over time. What data it is doesn’t matter, but essentially we are trying to use a set of variables to predict the value of another variable. Let us say that the relationship between the predictor variables and the predicted variable changes over time.

Now, if we were to build a model where we randomly divide data into calibration and test sets, the model will build will be something that will take into account the different regimes. The relationship between the predictor and predicted variables in the calibration data set is likely to be identical to the relationship between the predictor and predicted variables in the test data set – since both have been sampled uniformly across time. While that might be good, the problem is that this kind of a model has little predictive value.

Another way of splitting your data into calibration and test period is by splitting it over time. Rather than using a random number generator to split data into calibration and test parts, we simply use time. We can say that the data collected in the first 3/4th of the time period (in which we’ve collected the data) forms the calibration set, and the last 1/4th forms the test set. A model tested on this kind of calibration and test data is a stronger model, for it has predictive value!

In real life, if you have to predict a variable in the future, all you have at your disposal is a model that is calibrated on past data. Thus, you need a model that works across time. And in order to make sure you model can work across time, what you need to do is to split your data into calibration and test sets across time – that way you can check that model built with data from one time period can indeed work on data from a following time period!

Finally, how can you check if there is a “regime change” in the relationship between the predictor and predicted variables? We can use the difference in splitting data into calibration and test sets!

Firstly, split the data into calibration and test sets randomly. Find out how well the model explains the data in the test set. Next, split the data into calibration and test sets by time. Now find out how well the model explains the data in the test set. If there is not much difference in the performance of the model on the test set in these two cases, it means that there is no “regime change”. If there is a significant difference between the performance of the two models, it means that there is a definite regime change. Moreover, the extent of regime change can be evaluated based on the difference in goodness of fit in the two cases.