For a long time now, I’ve been sceptical of the practice of finding out the average income in a country or state or city or locality by doing a random survey. The argument I’ve made is “whether you keep Mukesh Ambani in the sample or not makes a huge difference in your estimate”. So far, though, I hadn’t been able to make a proper mathematical argument.
In the course of writing a piece for Bloomberg Quint (my first for that publication), I figured out a precise mathematical argument. Basically, incomes are distributed according to a power law distribution, and the exponent of the power law means that variance is not defined. And hence the Central Limit Theorem isn’t applicable.
OK let me explain that in English. The reason sample surveys work is due to a result known as the Central Limit Theorem. This states that for a distribution with finite mean and variance, the average of a random sample of data points is not very far from the average of the population, and the difference follows a normal distribution with zero mean and variance that is inversely proportional to the number of points surveyed.
So if you want to find out the average height of the population of adults in an area, you can simply take a random sample, find out their heights and you can estimate the distribution of the average height of people in that area. It is similar with voting intention – as long as the sample of people you survey is random (and without bias), the average of their voting intention can tell you with high confidence the voting intention of the population.
This, however, doesn’t work for income. Based on data from the Indian Income Tax department, I could confirm (what theory states) that income in India follows a power law distribution. As I wrote in my piece:
The basic feature of a power law distribution is that it is self-similar – where a part of the distribution looks like the entire distribution.
Based on the income tax returns data, the number of taxpayers earning more than Rs 50 lakh is 40 times the number of taxpayers earning over Rs 5 crore.
The ratio of the number of people earning more than Rs 1 crore to the number of people earning over Rs 10 crore is 38.
About 36 times as many people earn more than Rs 5 crore as do people earning more than Rs 50 crore.
In other words, if you increase the income limit by a factor of 10, the number of people who earn over that limit falls by a factor between 35 and 40. This translates to a power law exponent between 1.55 and 1.6 (log 35 to base 10 and log 40 to base 10 respectively).
Now power laws have a quirk – their mean and variance are not always defined. If the exponent of the power law is less than 1, the mean is not defined. If the exponent is less than 2, then the distribution doesn’t have a defined variance. So in this case, with an exponent around 1.6, the distribution of income in India has a well-defined mean but no well-defined variance.
To recall, the central limit theorem states that the population mean follows a normal distribution with the mean centred at the sample mean, and a variance of where is the standard deviation of the underlying distribution. And when the underlying distribution itself is a power law distribution with an exponent less than 2 (as the case is in India), itself is not defined.
Which means the distribution of population mean around sample mean has infinite variance. Which means the sample mean tells you absolutely nothing!
And hence, surveying is not a good way to find the average income of a population.