## Modelling for accuracy

Recently I’ve been remembering the first assignment of my “quantitative methods 2” course at IIMB back in 2004. In the first part of that course, we were learning regression. And so this assignment involved a regression problem. Not too hard at first sight – maybe 3 explanatory variables.

We had been randomly divided into teams of four. I remember working on it in the Computer Centre, in close proximity to some other teams. I remember trying to “do gymnastics” – combining variables, transforming them, all in the hope of trying to get the “best possible R square”. From what I remember, most of the groups went “R square hunting” that day. The assignment had been cleverly chosen such that for an academic exercise, the R Square wasn’t very high.

As an aside – one thing a lot of people take a long time to come to terms with is that in “real life” (industry problems) R squares aren’t usually that high. Forecast accuracy isn’t that high. And that the elegant methods they had learnt back in school / academia may not be as elegant any more in industry. I think I’ve written about this, but I can’t find the link now.

Anyway, back to QM2. I remember the professor telling us that three groups would be chosen at random on the day of the assignment submission, and from each of these three groups one person would be chosen at random who would have to present the group’s solution to the class. I remember that the other three people in my group all decided to bunk class that day! In any case, our group wasn’t called to present.

The whole point of this massive build up is – our approach (and the approach of most other groups) had been all wrong. We had just gone in a mad hunt for R square, not bothering to figure out whether the wild transformations and combinations that we were making made any business sense. Moreover, in our mad hunt for R square, we had all forgotten to consider whether a particular variable was significant, and if the regression itself was significant.

What we learnt was that while R square matters, it is not everything. The “model needs to be good”. The variables need to make sense. In statistics you can’t just go about optimising for one metric – there are several others. And this lesson has stuck with me. And guides how I approach all kinds of data modelling work. And I realise that is in conflict with the way data science is widely practiced nowadays.

The way data science is largely practiced in the wild nowadays is precisely a mad hunt for R Square (or area under ROC curve, if you’re doing a classification problem). Whether the variables used make sense doesn’t matter. Whether the transformations are sound doesn’t matter. It doesn’t matter at all whether the model is “good”, or appropriate – the only measure of goodness of the model seems to be the R square!

In a way, contests such as Kaggle have exacerbated this trend. In contests, typically, there is a precise metric (such as R Square) that you are supposed to maximise. With contests being evaluated algorithmically, it is difficult to evaluate on multiple parameters – especially not whether “the model is good”. And since nowadays a lot of data scientists hone their skills by participating in contests such as on Kaggle, they are tuned to simply go R square hunting.

Also, the big difference between Kaggle and real life is that in Kaggle, the model that you build doesn’t matter. It’s just a combination. You get the best R square. You win. You take the prize. You go home.

You don’t need to worry about how the data for the model was collected. The model doesn’t have to be implemented. No business decisions need to be made based on the model. Contest done, model done.

Obviously that is not how things work in real life. Building the model is only one in a long series of steps in solving the business problem. And when you focus too much on just one thing – the model’s accuracy in the data that you have been given, a lot can be lost in the rest of the chain (including application of the model in future situations).

And in this way, by focussing on just a small portion of the entire data science process (model building), I think Kaggle (and other similar competition platforms) has actually done a massive disservice to data science itself.

Tailpiece

This is completely unrelated to the rest of the post, but too small to merit a post of its own.

Suppose you ask a software engineer to sort a few datasets. He goes about applying bubble sort, heap sort, quick sort, insertion sort and a whole host of other techniques. And then picks the one that sorted the given datasets fastest.

That’s precisely how it seems “data science” is practiced nowadays

## Communicating Numbers

Earlier this week I read this masterful blogpost on Andrew Gelman’s blog (though the post itself is not written by Andrew Gelman – it’s written by Phil Price) about communicating numbers.

Basically the way you communicate a number can give a lot more information “between the lines”. Take the example at the top of the article:

“At the New York Marathon, three of the five fastest runners were wearing our shoes.” I’m sure I’m not the first or last person to have realized that there’s more information there than it seems at first. For one thing, you can be sure that one of those three runners finished fifth: otherwise the ad would have said “three of the four fastest.” Also, it seems almost certain that the two fastest runners were not wearing the shoes, and indeed it probably wasn’t 1-3 or 2-3 either: “The two fastest” and “two of the three fastest” both seem better than “three of the top five.” The principle here is that if you’re trying to make the result sound as impressive as possible, an unintended consequence is that you’re revealing the upper limit.

Incredible. So 3 in 5 means one of them is likely to be 5th. And likely one is fourth as well. Similarly, if you see a company that calls itself a “Fortune 500 company”, it is likely closer to 500 than to 100.

The other, slightly unrelated, example quoted in the article is about Covid-19 spread in outdoor conditions. There is another article that says that “less than 10% of covid-19 transmission that happens indoors”. This is misleading because if you say “less than 10%”, people will assume it’s 9%! The number, apparently, is closer to 0.1%.

There are many more such examples that we encounter in real life. If you write on LinkedIn that you went to a “top 10 ranked B-school”, it means you DID NOT go to a “top 5 ranked B-school”.

Loosely related to this, I’ve got a bit irritated over the last year and a bit in terms of imprecise numerical reporting by the media (related to covid-19). I won’t provide links or quotes here, since what I can remember are mostly by one person and I don’t want to implicate her here (and it’s a systemic problem, not unique to her).

You see reports saying “20000 new cases in Karnataka. A majority of them are from Bangalore”. I’ve seen this kind of a report even when 90% of the cases have been from Bangalore, and that is misleading – when you say “majority”, you instinctively think of “50% + 1”. Another report said “as many as 10000 cases”. Now, the “as many as” phrasing makes it sound like a very large number, but put in context, this 10000 wasn’t really very high.

Communication of numbers is an art that is not very well spread. Nowadays we see lots of courses on “telling stories with data”, “data visualisation”, graphics, etc. but none in terms of communication of sheer numbers itself.

Maybe I should record an episode about this in my forthcoming podcast. If you know who might be a good guest for it, AND can make an introduction, let me know.

## Communicating binary forecasts

One silver lining in the madness of the US Presidential election counting is that there are some interesting analyses floating around regarding polling and surveying and probabilities and visualisation. Take this post from Andrew Gelman’s blog, for example:

Suppose our forecast in a certain state is that candidate X will win 0.52 of the two-party vote, with a forecast standard deviation of 0.02. Suppose also that the forecast has a normal distribution.[…]

Then your 68% predictive interval for the candidate’s vote share is [0.50, 0.54], and your 95% interval is [0.48, 0.56].

Now suppose the candidate gets exactly half of the vote. Or you could say 0.499, the point being that he lost the election in that state.

This outcome falls on the boundary of the 68% interval, it’s one standard deviation away from the forecast. In no sense would this be called a prediction error or a forecast failure.

But now let’s say it another way. The forecast gave the candidate an 84% chance of winning! And then he lost. That’s pretty damn humiliating. The forecast failed.

It took me a while to appreciate this. In a binary outcome, if your model says predicts 52%, with a standard deviation of 2%, you are in effect predicting a “win” (50% or higher) with a probability of 84%! Somehow I had never thought about it that way.

In any case, this tells you how tricky forecasting a binary outcome is. You might think (based on your sample size) that a 2% standard deviation is reasonable. Except that when the mean of your forecast is close to the barrier (50% in this case), the “reasonable standard deviation” lends a much stronger meaning to your forecast.

Gelman goes on:

That’s right. A forecast of 0.52 +/- 0.02 gives you an 84% chance of winning.

We want to increase the sd in the above expression so as to send the win probability down to 60%. How much do we need to increase it? Maybe send it from 0.02 to 0.03?

> pnorm(0.52, 0.50, 0.03)
[1] 0.75


Uh, no, that wasn’t enough! 0.04?

> pnorm(0.52, 0.50, 0.04)
[1] 0.69


0.05 won’t do it either. We actually have to go all the way up to . . . 0.08:

> pnorm(0.52, 0.50, 0.08)
[1] 0.60


That’s right. If your best guess is that candidate X will receive 0.52 of the vote, and you want your forecast to give him a 60% chance of winning the election, you’ll have to ramp up the sd to 0.08, so that your 95% forecast interval is a ridiculously wide 0.52 +/- 2*0.08, or [0.36, 0.68].

Who said forecasting an election is easy?

## What is the Case Fatality Rate of Covid-19 in India?

The economist in me will give a very simple answer to that question – it depends. It depends on how long you think people will take from onset of the disease to die.

The modeller in me extended the argument that the economist in me made, and built a rather complicated model. This involved smoothing, assumptions on probability distributions, long mathematical derivations and (for good measure) regressions.. And out of all that came this graph, with the assumption that the average person who dies of covid-19 dies 20 days after the thing is detected.

Yes, there is a wide variation across the country. Given that the disease is the same and the treatment for most people diseased is pretty much the same (lots of rest, lots of water, etc), it is weird that the case fatality rate varies by so much across Indian states. There is only one explanation – assuming that deaths can’t be faked or miscounted (covid deaths attributed to other reasons or vice versa), the problem is in the “denominator” – the number of confirmed cases.

What the variation here tells us is that in states towards the top of this graph, we are likely not detecting most of the positive cases (serious cases will get themselves tested anyway, and get hospitalised, and perhaps die. It’s the less serious cases that can “slip”). Taking a state low down below in this graph as a “good tester” (say Andhra Pradesh), we can try and estimate what the extent of under-detection of cases in each state is.

Based on state-wise case tallies as of now (might be some error since some states might have reported today’s number and some mgiht not have), here are my predictions on how many actual number of confirmed cases there are per state, based on our calculations of case fatality rate.

Yeah, Maharashtra alone should have crossed a million caess based on the number of people who have died there!

Now let’s get to the maths. It’s messy. First we look at the number of confirmed cases per day and number of deaths per day per state (data from here). Then we smooth the data and take 7-day trailing moving averages. This is to get rid of any reporting pile-ups.

Now comes the probability assumption – we assume that a proportion $p$ of all the confirmed cases will die. We assume an average number of days ($N$) to death for people who are supposed to die (let’s call them Romeos?). They all won’t pop off exactly $N$ days after we detect their infection. Let’s say a proportion $\lambda$ dies each day. Of everyone who is infected, supposed to die and not yet dead, a proportion $\lambda$ will die each day.

My maths has become rather rusty over the years but a derivation I made shows that $\lambda = \frac{1}{N}$. So if people are supposed to die in an average of 20 days, $\frac{1}{20}$ will die today, $\frac{19}{20}\frac{1}{20}$ will die tomorrow. And so on.

So people who die today could be people who were detected with the infection yesterday, or the day before, or the day before day before (isn’t it weird that English doesn’t a word for this?) or … Now, based on how many cases were detected on each day, and our assumption of $p$ (let’s assume a value first. We can derive it back later), we can know how many people who were found sick $k$ days back are going to die today. Do this for all $k$, and you can model how many people will die today.

The equation will look something like this. Assume $d_t$ is the number of people who die on day $t$ and $n_t$ is the number of cases confirmed on day $t$. We get

$d_t = p (\lambda n_{t-1} + (1-\lambda) \lambda n_{t-2} + (1-\lambda)^2 \lambda n_{t-3} + ... )$

Now, all these $n$s are known. $d_t$ is known. $\lambda$ comes from our assumption of how long people will, on average, take to die once their infection has been detected. So in the above equation, everything except $p$ is known.

And we have this data for multiple days. We know the left hand side. We know the value in brackets on the right hand side. All we need to do is to find $p$, which I did using a simple regression.

And I did this for each state – take the number of confirmed cases on each day, the number of deaths on each day and your assumption on average number of days after detection that a person dies. And you can calculate $p$, which is the case fatality rate. The true proportion of cases that are resulting in deaths.

This produced the first graph that I’ve presented above, for the assumption that a person, should he die, dies on an average 20 days after the infection is detected.

So what is India’s case fatality rate? While the first graph says it’s 5.8%, the variations by state suggest that it’s a mild case detection issue, so the true case fatality rate is likely far lower. From doing my daily updates on Twitter, I’ve come to trust Andhra Pradesh as a state that is testing well, so if we assume they’ve found all their active cases, we use that as a base and arrive at the second graph in terms of the true number of cases in each state.

PS: It’s common to just divide the number of deaths so far by number of cases so far, but that is an inaccurate measure, since it doesn’t take into account the vintage of cases. Dividing deaths by number of cases as of a fixed point of time in the past is also inaccurate since it doesn’t take into account randomness (on when a Romeo might die).

Anyway, here is my code, for what it’s worth.

deathRate <- function(covid, avgDays) {
covid %>%
mutate(Date=as.Date(Date, '%d-%b-%y')) %>%
gather(State, Number, -Date, -Status) %>%
arrange(State, Date) ->
cov1

# Need to smooth everything by 7 days
cov1 %>%
arrange(State, Date) %>%
group_by(State) %>%
mutate(
TotalConfirmed=cumsum(Confirmed),
TotalDeceased=cumsum(Deceased),
ConfirmedMA=(TotalConfirmed-lag(TotalConfirmed, 7))/7,
DeceasedMA=(TotalDeceased-lag(TotalDeceased, 7))/ 7
) %>%
ungroup() %>%
filter(!is.na(ConfirmedMA)) %>%
select(State, Date, Deceased=DeceasedMA, Confirmed=ConfirmedMA) ->
cov2

cov2 %>%
select(DeathDate=Date, State, Deceased) %>%
inner_join(
cov2 %>%
select(ConfirmDate=Date, State, Confirmed) %>%
crossing(Delay=1:100) %>%
mutate(DeathDate=ConfirmDate+Delay),
by = c("DeathDate", "State")
) %>%
filter(DeathDate > ConfirmDate) %>%
arrange(State, desc(DeathDate), desc(ConfirmDate)) %>%
mutate(
Lambda=1/avgDays,
) %>%
filter(Deceased > 0) %>%
group_by(State, DeathDate, Deceased) %>%
ungroup() %>%
summary() %>%
broom::tidy() %>%
select(estimate) %>%
first() %>%
return()
}

## More on covid testing

There has been a massive jump in the number of covid-19 positive cases in Karnataka over the last couple of days. Today, there were 44 new cases discovered, and yesterday there were 36. This is a big jump from the average of about 15 cases per day in the preceding 4-5 days.

The good news is that not all of this is new infection. A lot of cases that have come out today are clusters of people who have collectively tested positive. However, there is one bit from yesterday’s cases (again a bunch of clusters) that stands out.

I guess by now everyone knows what “travelled from Delhi” is a euphemism for. The reason they are interesting to me is that they are based on a “repeat test”. In other words, all these people had tested negative the first time they were tested, and then they were tested again yesterday and found positive.

Why did they need a repeat test? That’s because the sensitivity of the Covid-19 test is rather low. Out of every 100 infected people who take the test, only about 70 are found positive (on average) by the test. That also depends upon when the sample is taken.  From the abstract of this paper:

Over the four days of infection prior to the typical time of symptom onset (day 5) the probability of a false negative test in an infected individual falls from 100% on day one (95% CI 69-100%) to 61% on day four (95% CI 18-98%), though there is considerable uncertainty in these numbers. On the day of symptom onset, the median false negative rate was 39% (95% CI 16-77%). This decreased to 26% (95% CI 18-34%) on day 8 (3 days after symptom onset), then began to rise again, from 27% (95% CI 20-34%) on day 9 to 61% (95% CI 54-67%) on day 21.

About one in three (depending upon when you draw the sample) infected people who have the disease are found by the test to be uninfected. Maybe I should state it again. If you test a covid-19 positive person for covid-19, there is almost a one-third chance that she will be found negative.

The good news (at the face of it) is that the test has “high specificity” of about 97-98% (this is from conversations I’ve had with people in the know. I’m unable to find links to corroborate this), or a false positive rate of 2-3%. That seems rather accurate, except that when the “prior probability” of having the disease is low, even this specificity is not good enough.

Let’s assume that a million Indians are covid-19 positive (the official numbers as of today are a little more than one-hundredth of that number). With one and a third billion people, that represents 0.075% of the population.

Let’s say we were to start “random testing” (as a number of commentators are advocating), and were to pull a random person off the street to test for Covid-19. The “prior” (before testing) likelihood she has Covid-19 is 0.075% (assume we don’t know anything more about her to change this assumption).

If we were to take 20000 such people, 15 of them will have the disease. The other 19985 don’t. Let’s test all 20000 of them.

Of the 15 who have the disease, the test returns “positive” for 10.5 (70% accuracy, round up to 11). Of the 19985 who don’t have the disease, the test returns “positive” for 400 of them (let’s assume a specificity of 98% (or a false positive rate of 2%), placing more faith in the test)! In other words, if there were a million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 11/411 = 2.6%.

If there were 10 million covid-19 positive people in India (no harm in supposing), then the “base rate” would be .75%. So out of our sample of 20000, 150 would have the disease. Again testing all 20000, 105 of the 150 who have the disease would test positive. 397 of the 19850 who don’t have the disease will test positive. In other words, if there were ten million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 105/(397+105) = 21%.

###### If there were ten million Covid-19 positive people in India, only one-fifth of the people who tested positive in a random test would actually have the disease.

Take a sip of water (ok I’m reading The Ken’s Beyond The First Order too much nowadays, it seems).

This is all standard maths stuff, and any self-respecting book or course on probability and Bayes’s Theorem will have at least a reference to AIDS or cancer testing. The story goes that this was a big deal in the 1990s when some people suggested that the AIDS test be used widely. Then, once this problem of false positives and posterior probabilities was pointed out, the strategy of only testing “high risk cases” got accepted.

And with a “low incidence” disease like covid-19, effective testing means you test people with a high prior probability. In India, that has meant testing people who travelled abroad, people who have come in contact with other known infected, healthcare workers, people who attended the Tablighi Jamaat conference in Delhi, and so on.

The advantage with testing people who already have a reasonable chance of having the disease is that once the test returns positive, you can be pretty sure they actually have the disease. It is more effective and efficient. Testing people with a “high prior probability of disease” is not discriminatory, or a “sampling bias” as some commentators alleged. It is prudent statistical practice.

Again, as I found to my own detriment with my tweetstorm on this topic the other day, people are bound to see politics and ascribe political motives to everything nowadays. In that sense, a lot of the commentary is not surprising. It’s also not surprising that when “one wing” heavily retweeted my article, “the other wing” made efforts to find holes in my argument (which, again, is textbook math).

One possibly apolitical criticism of my tweetstorm was that “the purpose of random testing is not to find out who is positive. It is to find out what proportion of the population has the disease”. The cost of this (apart from the monetary cost of actually testing) are threefold. Firstly, a large number of uninfected people will get hospitalised in covid-specific hospitals, clogging hospital capacity and increasing the chances that they get infected while in hospital.

Secondly, getting a truly random sample in this case is tricky, and possibly unethical. When you have limited testing capacity, you would be inclined (possibly morally, even) to use it on people who already have a high prior probability.

Finally, when the incidence is small, we need a really large sample to find out the true range.

Let’s say 1 in 1000 Indians have the disease (or about 1.35 million people). Using the Chi Square test of proportions, our estimate of the incidence of the disease varies significantly on how many people are tested.

If we test a 1000 people and find 1 positive, the true incidence of the disease (95% confidence interval) could be anywhere from 0.01% to 0.65%.

If we test 10000 people and find 10 positive, the true incidence of the disease could be anywhere between 0.05% and 0.2%.

Only if we test 100000 people (a truly massive random sample) and find 100 positive, then the true incidence lies between 0.08% and 0.12%, an acceptable range.

I admit that we may not be testing enough. A simple rule of thumb is that anyone with more than a 5% prior probability of having the disease needs to be tested. How we determine this prior probability is again dependent on some rules of thumb.

I’ll close by saying that we should NOT be doing random testing. That would be unethical on multiple counts.

## Surveying Income

For a long time now, I’ve been sceptical of the practice of finding out the average income in a country or state or city or locality by doing a random survey. The argument I’ve made is “whether you keep Mukesh Ambani in the sample or not makes a huge difference in your estimate”. So far, though, I hadn’t been able to make a proper mathematical argument.

In the course of writing a piece for Bloomberg Quint (my first for that publication), I figured out a precise mathematical argument. Basically, incomes are distributed according to a power law distribution, and the exponent of the power law means that variance is not defined. And hence the Central Limit Theorem isn’t applicable.

OK let me explain that in English. The reason sample surveys work is due to a result known as the Central Limit Theorem. This states that for a distribution with finite mean and variance, the average of a random sample of data points is not very far from the average of the population, and the difference follows a normal distribution with zero mean and variance that is inversely proportional to the number of points surveyed.

So if you want to find out the average height of the population of adults in an area, you can simply take a random sample, find out their heights and you can estimate the distribution of the average height of people in that area. It is similar with voting intention – as long as the sample of people you survey is random (and without bias), the average of their voting intention can tell you with high confidence the voting intention of the population.

This, however, doesn’t work for income. Based on data from the Indian Income Tax department, I could confirm (what theory states) that income in India follows a power law distribution. As I wrote in my piece:

The basic feature of a power law distribution is that it is self-similar – where a part of the distribution looks like the entire distribution.

Based on the income tax returns data, the number of taxpayers earning more than Rs 50 lakh is 40 times the number of taxpayers earning over Rs 5 crore.
The ratio of the number of people earning more than Rs 1 crore to the number of people earning over Rs 10 crore is 38.
About 36 times as many people earn more than Rs 5 crore as do people earning more than Rs 50 crore.

In other words, if you increase the income limit by a factor of 10, the number of people who earn over that limit falls by a factor between 35 and 40. This translates to a power law exponent between 1.55 and 1.6 (log 35 to base 10 and log 40 to base 10 respectively).

Now power laws have a quirk – their mean and variance are not always defined. If the exponent of the power law is less than 1, the mean is not defined. If the exponent is less than 2, then the distribution doesn’t have a defined variance. So in this case, with an exponent around 1.6, the distribution of income in India has a well-defined mean but no well-defined variance.

To recall, the central limit theorem states that the population mean follows a normal distribution with the mean centred at the sample mean, and a variance of $\frac{\sigma^2}{n}$ where $\sigma$ is the standard deviation of the underlying distribution. And when the underlying distribution itself is a power law distribution with an exponent less than 2 (as the case is in India), $\sigma$ itself is not defined.

Which means the distribution of population mean around sample mean has infinite variance. Which means the sample mean tells you absolutely nothing!

And hence, surveying is not a good way to find the average income of a population.

## Bayesian Reasoning and Indian Philosophy

I’m currently reading a book called How the World Thinks: A global history of philosophy by Julian Baggini. I must admit I bought this by mistake – I was at a bookshop where I saw this book and went to the Amazon website to check reviews. And by mistake I ended up hitting buy. And before I got around to returning it, I started reading and liking it, so I decided to keep it.

In any case, this book is a nice comparative history of world philosophies, with considerable focus on Indian, Chinese, Japanese and Islamic philosophies. The author himself is trained in European/Western philosophy, but he keeps an open mind and so far it’s been an engaging read.

Rather than approaching the topic in chronological order, like some historians might have been tempted to do, this book approaches it by concept, comparing how different philosophies treat the same concept. And the description of Indian philosophy in the “Logic” chapter caught my eye, in the sense that it reminded me of Bayesian logic, and a piece I’d written a few years back.

Talking about Hindu philosophy and logic, Baggins writes:

For instance, the Veda affirms that when the appropriate sacrifice for the sake of a son is performed, a son will be produced. But it is often observed that a son is not produced, even though the sacrifice has been performed. This would seem to be pretty conclusive proof that the sacrifices don’t work and so the Veda is flawed. Not, however, if you start from the assumption that the Veda cannot be flawed.

In other words, Hindu Philosophy starts with the Bayesian prior that the Veda cannot be flawed. Consequently, irrespective of how strong the empirical evidence that the Vedas are flawed, the belief in the Vedas can never change! On the other hand, if the prior probability that the Vedas were flawed were positive but even infinitesimal, then the amount of evidences such as the above (where sacrifices that are supposed to have produced sons but fail to do so) would over time result in the probability of the Vedas being flawed increasing, and soon tending to 1.

In 2015, I had written in Mint about how Bayesian logic can be used to explain online flame wars. There again, I had written about how when people start with extreme opinions (probabilities equal to 0 or 1), even the strongest contrary evidence is futile to get them to change their opinions. And hence in online flame wars you have people simply talking past each other because neither is willing to update their opinions in the face of evidence.

Coming back to Hindu philosophy, this prior belief that the Vedas cannot be flawed reminds me of the numerous futile arguments with some of my relatives who are of a rather religious persuasion. In each case I presented to them what seemed like strong proof that some of their assumptions of religion are flawed. In each case, irrespective of the strength of my evidence, they refused to heed my argument. Now, looking at the prior of a religious Hindu – that the likelihood of the Vedas being flawed is 0 (not infinitesimal, but 0), it is clear why my arguments fell on deaf ears.

In any case, Baggini goes on to say:

By this logic, if ‘a son is sure to be produced as a result of performing the sacrifice’ but a son is not produced, it can only follow that the sacrifice was not performed correctly, however much it seems that it was performed properly. By such argument, the Ny?ya S?tra can safely conclude, ‘Therefore there is no untruth in the Veda.’

## Hypothesis Testing in Monte Carlo

I find it incredible, and not in a good way, that I took fourteen years to make the connection between two concepts I learnt barely a year apart.

In August-September 2003, I was auditing an advanced (graduate) course on Advanced Algorithms, where we learnt about randomised algorithms (I soon stopped auditing since the maths got heavy). And one important class of randomised algorithms is what is known as “Monte Carlo Algorithms”. Not to be confused with Monte Carlo Simulations, these are randomised algorithms that give a one way result. So, using the most prominent example of such an algorithm, you can ask “is this number prime?” and the answer to that can be either “maybe” or “no”.

The randomised algorithm can never conclusively answer “yes” to the primality question. If the algorithm can find a prime factor of the number, it answers “no” (this is conclusive). Otherwise it returns “maybe”. So the way you “conclude” that a number is prime is by running the test a large number of times. Each run reduces the probability that it is a “no” (since they’re all independent evaluations of “maybe”), and when the probability of “no” is low enough, you “think” it’s a “yes”. You might like this old post of mine regarding Monte Carlo algorithms in the context of romantic relationships.

Less than a year later, in July 2004, as part of a basic course in statistics, I learnt about hypothesis testing. Now (I’m kicking myself for failing to see the similarity then), the main principle of hypothesis testing is that you can never “accept a hypothesis”. You either reject a hypothesis or “fail to reject” it.  And if you fail to reject a hypothesis with a certain high probability (basically with more data, which implies more independent evaluations that don’t say “reject”), you will start thinking about “accept”.

Basically hypothesis testing is a one-sided  test, where you are trying to reject a hypothesis. And not being able to reject a hypothesis doesn’t mean we necessarily accept it – there is still the chance of going wrong if we were to accept it (this is where we get into messy territory such as p-values). And this is exactly like Monte Carlo algorithms – one-sided algorithms where we can only conclusively take a decision one way.

So I was thinking of these concepts when I came across this headline in ESPNCricinfo yesterday that said “Rahul Johri not found guilty” (not linking since Cricinfo has since changed the headline). The choice, or rather ordering, of words was interesting. “Not found guilty”, it said, rather than the usual “found not guilty”.

This is again a concept of one-sided testing. An investigation can either find someone guilty or it fails to do so, and the heading in this case suggested that the latter had happened. And as a deliberate choice, it became apparent why the headline was constructed this way – later it emerged that the decision to clear Rahul Johri of sexual harassment charges was a contentious one.

In most cases, when someone is “found not guilty” following an investigation, it usually suggests that the evidence on hand was enough to say that the chance of the person being guilty was rather low. The phrase “not found guilty”, on the other hand, says that one test failed to reject the hypothesis, but it didn’t have sufficient confidence to clear the person of guilt.

So due credit to the Cricinfo copywriters, and due debit to the product managers for later changing the headline rather than putting a fresh follow-up piece.

PS: The discussion following my tweet on the topic threw up one very interesting insight – such as Scotland having had a “not proven” verdict in the past for such cases (you can trust DD for coming up with such gems).

## Dimensional analysis in stochastic finance

Yesterday I was reading through Ole Peters’s lecture notes on ergodicity, a topic that I got interested in thanks to my extensive use of Utility Theory in my work nowadays. And I had a revelation – that in standard stochastic finance, mean returns and standard deviation of returns don’t have the same dimensions. Instead, it’s mean returns and the variance of returns that have the same dimensions.

While this might sound counterintuitive, it is not hard to see if you think about it analytically. We will start with what is possibly the most basic equation in stochastic finance, which is the lognormal random walk model of stock prices.

$dS = \mu S dt + \sigma S dW$

This can be rewritten as

$\frac{dS}{S} = \mu dt + \sigma dW$

Now, let us look at dimensions. The LHS divides change in stock price by stock price, and is hence dimensionless. So the RHS needs to be dimensionless as well if the equation is to make sense.

It is easy to see that the first term on the RHS is dimensionless – $\mu$, the average returns or the drift, is defined as “returns per unit time”. So a stock that returns, on average, 10% in a year returns 20% in two years. So returns has dimensions $t^{-1}$, and multiplying it with $dt$ which has the unit of time renders it dimensionless.

That leaves us with the last term. $dW$ is the Wiener Process, and is defined such that $dW^2 = dt$. This implies that $dW$ has the dimensions $\sqrt{t}$. This means that the equation is meaningful if and only if $\sigma$ has dimensions $t^{-\frac{1}{2}}$, which is the same as saying that $\sigma^2$ has dimensions $\frac{1}{t}$, which is the same as the dimensions of the mean returns.

It is not hard to convince yourself that it makes intuitive sense as well. The basic assumption of a random walk is that the variance grows linearly with time (another way of seeing this is that when you add two uncorrelated random variables, their variances add up to give the variance of the sum). From this again, variance has the units of inverse time – the same as the mean.

Finally, speaking of dimensional analysis and Ole Peters, check out his proof of the Pythagoras Theorem using dimensional analysis.

Isn’t it beautiful?

PS: Speaking of dimensional analysis, check out my recent post on stocks and flows and financial ratios.

## Machine learning and degrees of freedom

For starters, machine learning is not magic. It might appear like magic when you see Google Photos automatically tagging all your family members correctly, down to the day of their birth. It might appear so when Siri or Alexa give a perfect response to your request. And the way AlphaZero plays chess is almost human!

But no, machine learning is not magic. I’d made a detailed argument about that in the second edition of my newsletter (subscribe if you haven’t already!).

One way to think of it is that the output of a machine learning model (which could be anything from “does this picture contain a cat?” to “is the speaker speaking in English?”) is the result of a mathematical formula, whose parameters are unknown at the beginning of the exercise.

As the system gets “trained” (of late I’ve avoided using the word “training” in the context of machine learning, preferring to use “calibration” instead. But anyway…), the hitherto unknown parameters of the formula get adjusted in a manner that the formula output matches the given data. Once the system has “seen” enough data, we have a model, which can then be applied on unknown data (I’m completely simplifying it here).

The genius in machine learning comes in setting up mathematical formulae in a way that given input-output pairs of data can be used to adjust the parameters of the formulae. The genius in deep learning, which has been the rage this decade, for example, comes from a 30-year old mathematical breakthrough called “back propagation”. The reason it took until a few years back for it to become a “thing” has to do with data availability, and compute power (check this terrific piece in the MIT Tech Review about deep learning).

Within machine learning, the degree of complexity of a model can vary significantly. In an ordinary univariate least squares regression, for example, there are only two parameters the system can play with (slope and intercept of the regression line). Even a simple “shallow” neural network, on the other hand, has thousands of parameters.

Because a regression has so few parameters, the kind of patterns that the system can detect is rather limited (whatever you do, the system can only draw a line. Nothing more!). Thus, regression is applied only when you know that the relationship that exists is simple (and linear), or when you are trying to force-fit a linear model.

The upside of simple models such as regression is that because there are so few parameters to be adjusted, you need relatively few data points in order to adjust them to the required degree of accuracy.

As models get more and more complicated, the number of parameters increases, thus increasing the complexity of patterns that can be detected by the system. Close to one extreme, you have systems that see lots of current pictures of you and then identify you in your baby pictures.

Such complicated patterns can be identified because the system parameters have lots of degrees of freedom. The downside, of course, is that because the parameters start off having so much freedom, it takes that much more data to “tie them down”. The reason Google Photos can tag you in your baby pictures is partly down to the quantum of image data that Google has, which does an effective job of tying down the parameters. Google Translate similarly uses large repositories of multi-lingual text in order to “learn languages”.

Like most other things in life, machine learning also involves a tradeoff. It is possible for systems to identify complex patterns, but for that you need to start off with lots of “degrees of freedom”, and then use lots of data to tie down the variables. If your data is small, then you can only afford a small number of parameters, and that limits the complexity of patterns that can be detected.

One way around this, of course, is to use your own human intelligence as a pre-processing step in order to set up parameters in a way that they can be effectively tuned by data. Gopi had a nice post recently on “neat learning versus deep learning“, which is relevant in this context.

Finally, there is the issue of spurious correlations. Because machine learning systems are basically mathematical formulae designed to learn patterns from data, spurious correlations in the input dataset can lead to the system learning random things, which can hamper its predictive power.

Data sets, especially ones that have lots of dimensions, can display correlations that appear at random, but if the input dataset shows enough of these correlations, the system will “learn” them as a pattern, and try to use them in predictions. And the more complicated your model gets, the harder it is to know what it is doing, and thus the harder it is to identify these spurious correlations!

And the thing with having too many “free parameters” (lots of degrees of freedom but without enough data to tie down the parameters) is that these free parameters are especially susceptible to learning the spurious correlations – for they have no other job.

Thinking about it, after all, machine learning systems are not human!