## Surveying Income

For a long time now, I’ve been sceptical of the practice of finding out the average income in a country or state or city or locality by doing a random survey. The argument I’ve made is “whether you keep Mukesh Ambani in the sample or not makes a huge difference in your estimate”. So far, though, I hadn’t been able to make a proper mathematical argument.

In the course of writing a piece for Bloomberg Quint (my first for that publication), I figured out a precise mathematical argument. Basically, incomes are distributed according to a power law distribution, and the exponent of the power law means that variance is not defined. And hence the Central Limit Theorem isn’t applicable.

OK let me explain that in English. The reason sample surveys work is due to a result known as the Central Limit Theorem. This states that for a distribution with finite mean and variance, the average of a random sample of data points is not very far from the average of the population, and the difference follows a normal distribution with zero mean and variance that is inversely proportional to the number of points surveyed.

So if you want to find out the average height of the population of adults in an area, you can simply take a random sample, find out their heights and you can estimate the distribution of the average height of people in that area. It is similar with voting intention – as long as the sample of people you survey is random (and without bias), the average of their voting intention can tell you with high confidence the voting intention of the population.

This, however, doesn’t work for income. Based on data from the Indian Income Tax department, I could confirm (what theory states) that income in India follows a power law distribution. As I wrote in my piece:

The basic feature of a power law distribution is that it is self-similar – where a part of the distribution looks like the entire distribution.

Based on the income tax returns data, the number of taxpayers earning more than Rs 50 lakh is 40 times the number of taxpayers earning over Rs 5 crore.
The ratio of the number of people earning more than Rs 1 crore to the number of people earning over Rs 10 crore is 38.
About 36 times as many people earn more than Rs 5 crore as do people earning more than Rs 50 crore.

In other words, if you increase the income limit by a factor of 10, the number of people who earn over that limit falls by a factor between 35 and 40. This translates to a power law exponent between 1.55 and 1.6 (log 35 to base 10 and log 40 to base 10 respectively).

Now power laws have a quirk – their mean and variance are not always defined. If the exponent of the power law is less than 1, the mean is not defined. If the exponent is less than 2, then the distribution doesn’t have a defined variance. So in this case, with an exponent around 1.6, the distribution of income in India has a well-defined mean but no well-defined variance.

To recall, the central limit theorem states that the population mean follows a normal distribution with the mean centred at the sample mean, and a variance of $\frac{\sigma^2}{n}$ where $\sigma$ is the standard deviation of the underlying distribution. And when the underlying distribution itself is a power law distribution with an exponent less than 2 (as the case is in India), $\sigma$ itself is not defined.

Which means the distribution of population mean around sample mean has infinite variance. Which means the sample mean tells you absolutely nothing!

And hence, surveying is not a good way to find the average income of a population.

## Inequality in income and consumption

My hypothesis is that while inequality in terms of income or wealth (measured in rupees/dollars) has been growing, consumption inequality is actually coming down. I hope to do a more detailed analysis using data, but I’ll stick to an anecdote for this this introductory blogpost.

The trigger for this thought came about a year back, at a meeting in one of the organisations I’m associated with. The meeting wasn’t terribly interesting, so I spent time checking out the guy sitting next to me, whose Net Worth I knew is at least a couple of orders of magnitude more than mine.

He was wearing a Louis Philippe shirt, and I have several shirts of that brand. He had a Parker pen, and I use a Parker too. He had a rather fancy watch whose brand I do not recall now, but my Seiko isn’t that bad in comparison. And he had an iPhone, which cost four times as much as the phone I used then (a Moto G), but not out of reach for me.

I can go on but the gist is that while our income and wealth levels were different by an order of magnitude, our consumption wasn’t all that far off. I must admit that I’m also a so-called “1-percenter” in terms of income (recall a study which said that 99th percentile of income in India is Rs. 12 lakh per annum), so I was also part of the power law tail, yet the marginal difference in consumption to income levels was strikingly low.

Since this is an introductory blog post on this topic, I posit that this is a more general trend and applies at many other levels. The thing with inequality is that income (and wealth) is usually distributed according to a Power Law (unless the state is extremely coercive and extractive), so as the economy grows, inequality as measured by measures such as the Gini coefficient is bound to increase (here’s a nice but hard-to-read paper by Nassim Nicholas Taleb on why the Gini coefficient is flawed for fat-tailed distributions such as the power law).

Yet, as the economy grows, more people are pushed beyond a “basic level” of income where they are able to afford “necessities” (and certain kinds of luxuries), so inequality as measured by consumption will actually be lower. The challenge is in measuring such inequality appropriately.

I’ll mention a couple of more anecdotes in support of this. One sector where inequality has fallen is in commute. Some rich old-time Bangaloreans look back in nostalgia at a time when there was no congestion on the streets of Bangalore, and how the city has since deteriorated. Yet, that congestion-free travel was then available only to the extremely wealthy (who could afford private vehicles) or lucky (my father waited for four years to get his first scooter because of limited supply). Public transport infrastructure was abysmal and buses infrequent.

Now, a larger proportion of the population can afford private vehicles and public transport has also improved (though not by much), making life better at the lower end of income/wealth levels. And the rich (who had exclusive access to roads in private cars earlier) are faced with higher congestion.

Another obvious example is the telephone. Very few people had them even twenty years back (we applied for ours in 1989, only to get “allotted” a phone in 1995), and now pretty much everyone has a basic mobile phone now (and with cheaper smart phones, even some relatively poor people own smart phones).

This is a theory worth pursuing. Need to analyse how to collect data and measure inequality, but I think there’s something to this hypothesis. Any thoughts will be welcome!

## Measuring Income

Earlier today I was reading an interview in the Business Standard with Shaibal Gupta, Secretary of the Asian Development Research Institute and member of the Raghuram Rajan committee on composite development index of states.  Gupta wrote a dissenting note to the report, with his main contention being the use of the Median Per Capita Expenditure (MPCE) as a measure of income to compare states rather than using the Per Capita Gross State Domestic Product (GSDP). I must state up front that I agree with the report here, and will use this post to defend my stance. Meanwhile, I must mention that one of the reasons he gave for using the GSDP (“Per capita income is taken as an indicator for this purpose by a number of institutions, including the Planning Commission and Finance Commissions.”, he said) almost made me fall off my chair.

Suppose you run a manufacturing company. Your production facility is located in Hosur, Tamil Nadu. However, for administrative convenience, and for the convenience of your top management, you have decided to headquarter your company in Bangalore, Karnataka (for the record, Hosur is just about 35 kms from Bangalore). Most of your workers live in Tamil Nadu, and draw their salaries there. Your top management gets compensated in Karnataka, and they live there. The question is how your company contributes to the economies of the two states.

From an accounting perspective, all your sales are attributed to Karnataka, for you are headquartered there. Of course, what your workers in Tamil Nadu spend out of their salaries will be accounted for in that State’s GDP but the overall sales of the company itself will be attributed to Karnataka, even though the company does next to no economic activity there. With the simple act of locating your company headquarters in Karnataka, you push up Karnataka’s GSDP while reducing Tamil Nadu’s. Some states (eg. Maharashtra and Delhi) are much more popular than others for the location of company headquarters, and they can lead to a fairly distorted figure of how much is produced in each state.

That is not all. The problem with Per Capita GSDP is that it is a mean figure, and is thus liable to be grossly affected by extreme values. Let us say we are comparing the income levels in two neighbourhoods. Neighbourhood A has 1000 people.999 of them earn Rs. 100 per month while the 1000th earns Rs. 1 crore per month. Neighbourhood B also has 1000 people but each of them earns Rs. 10000. Which neighbourhood is richer?

If you go by the mean income, the mean income of A is Rs. (999 * 1000 + 1 * 1,00,00,000)/1000 = Rs. 10999. The mean income of B is Rs. 10000. So you would say that A is richer than B. While on an average that might be true, you might notice that the number for A is skewed by the one rich guy. What this hides is the fact that 99.9% of A earn only a tenth of B’s mean income. Can we do better?

Instead of looking at what the resident of a neighbourhood makes on an average, what if we instead measure what the average person in the neighbourhood makes? In other words, what if we measure the median income in each neighbourhood? The advantage with the median is that it doesn’t get skewed by extreme values, as is likely in case of a variable such as income which usually follows a power law distribution. In our example above, the median income of A is Rs. 1000 while that of B is Rs. 10,000 which is probably a better reflection of the richness of the average resident of these two localities.

Similarly, the per capita GSDP, being a mean measure, is not a great measure for determining the richness or poorness of the people of a state. Suppose, for example, that neighbourhoods A and B are two states. Notice that A will have a much larger GSDP compared to B, and that this tells us nothing about the richness of the average resident of these two states.

Putting both above reasons together, you realize that the per capita GSDP is not  a great estimator of the richness of a particular state.

So what do we use? We discussed above that median income is a much better metric than the mean income. So can we use that for measuring richness instead? While it sounds good in theory, we have a practical and accounting problem – given that a large part of the country is essentially a cash economy, it is hard to keep track of people’s incomes. Moreover, there are enough reasons to both under-report and over-report one’s income if you were to ask someone as part of a survey. For this reason, the general consensus among development economists is that total consumption expenditure is a good estimate of income among the poor, whose net savings rate is negligible.

What about the non-poor, you may ask. Notice that we are only trying to capture the expenditure of the median resident of a state, and assuming that more than 50% of a state is within an income level at which income equals consumption expenditure is fair. So the median per capita expenditure will give a good picture.

So how do we estimate this? Unfortunately, we don’t have any accounting statistics that capture this, and we need to rely on surveys. The National Sample Survey Organization (NSSO) conducts surveys on people’s consumption expenditure every five years, and this is what the Rajan committee has used. Now, you may question the wisdom of relying on sample data (rather than “population data”) to determine the richness or poorness.

The answer to that is that the median is a rather robust statistic, and as long as samples have been chosen at random, it is unlikely that the median of a sample will be too far away from the median of the population (and this is independent of the distribution of the population). We will examine the issue of sampling median in a subsequent post.

In conclusion, I endorse the decision of the Rajan committee to use the median per capita consumption expenditure as a metric for determining the richness or poorness of a state.

## Importance of candidate’s caste in voting

Not-for-profit Daksh recently conducted a massive survey in Karnataka which tried to understand voter preferences, evaluate MLA performance, etc. This was a comprehensive survey covering over 12000 voters across all districts in Karnataka. Apart from capturing demographic information, the survey asks questions on what candidates look for in a candidate and what issues they think are important for an MLA.

One of the questions asked was the importance of a candidate’s caste when it comes to voting. Voters were asked to indicate if it was “very important”, “important” or “not important”. For purpose of my analysis I’ve given a score of 1 for “very important” and 0.5 for “important” and 0 for “not important”. The relationship between a voter’s annual family income with his perception on the importance of caste is extremely interesting, as this graph indicates.