On dealing with good and bad news

So I was having a rough time an hour or so back and called the wife, and told her so, and that I’d been stuck in this vicious circle of negativity for a while now. In response, she said there was this nice TED talk that she had seen on the topic recently, and I should watch it too. And so she sent me this:

It’s a nice TED talk, but I think she uses too many words to describe what she needs to describe. It’s just not “quick enough” (check out the wife’s blog post on distractions caused by professors being too slow in class. She only talked about the throughput of words here, but I think it’s deeper and extends to throughput of information content) and it doesn’t need ten minutes to communicate what she has said here.

And so I thought about how I could convey the message better. I realised that the entire talk above could be condensed into one little finite automaton. And then I drew it (using the Paintbrush App on my Mac).

Goodandbadnews

I must say I’m feeling much better already! Tell me if this is a good representation, though!

Marrying out of caste – 1

This is the first in what is going to hopefully be a long series of posts on inter-caste marriages. As you might have figured out, I’ve stumbled upon a nice data set with lots of data on this topic (Hat tip: Nitin Pai and Rohit Pradhan), and there are some beautiful insights in the data.

The data is based on a National Family Health Survey which was conducted in 2005-06. The sample size of the survey itself was massive – close to a lakh respondents for the entire survey, and about 43,000 women who were surveyed on the inter-caste marriage question alone. So the survey, which was carried out in all states in India, asked “ever-married” women whether they were married to someone from the same caste, or to someone from a higher caste, or to someone from a lower caste. There was also some demographic data collected which leads to some interesting cross-tabs we can explore in either this post or one other of the series.

If there is one single piece of information that can summarise the survey, it is that the national average for the percentage of women who are married to someone of their own caste is 89%, and this number doesn’t vary by much across demographics or region or any other socio-economic indicators.

Of course, there are differences, and some regional differences are vast. For example, 97% of women surveyed in Tamil Nadu were married to someone from the same caste, while the corresponding figure in Punjab is only 80%. Figure 1 here shows the distribution across states of the percentage of women married to men of the same caste.

intercaste1

 

Different colours here represent different regions of India, and considering that the data in the above graph has been sorted by the value, the reasonably random distribution of colour in this graph (anyone notice a pattern anywhere?) shows that there is no real regional trend. But the inter-state differences represented in this graph are stark (80% to 97%). It raises the question regarding the homogeneity of castes and possibly differing definitions of castes in different states.

For example, some people might define caste as their “varna”, while some might go deeper into the family’s traditional occupation. Others might go further deeper – there is no end to the level you can reach in the caste hierarchy. Might it be possible that the stark regional differences can be explained by the varying definitions of caste?

Another interesting piece of data given is the percentage of women in each state married to either men of a higher or a lower caste. Now, in the interest of natural balance and matching, these two numbers ought to be equal (the paper notices a surprise that these two numbers are equal in most states – but there is no reason to be surprised). Actually we can create an “imbalance index” for each state – the difference between the percentage of women married to men of a higher caste and the percentage of women married to men of a lower caste.

A positive index indicates that women in the state prefer to “marry up” (men of higher caste) than “marry down”. It also indicates that in the absence of inter-state “trade” of marital partners, there will be large numbers of unmarried men of the lowest caste and women of the highest caste in that state! A negative index implies an excess of single men of the highest caste and single women of the lowest caste (both these calculations assume, of course, that the sex ratio is the same across castes). The second figure here plots this index across states. The  colouring scheme is the same.

intercaste2

This shows that there are states with massive imbalances – Maharashtra, for example will end up having a large number of single men of the lowest caste and single women of the highest caste unless they get “cleared” in “trade” with other states. Kerala has the opposite problem. It is interesting to notice that Punjab, which has the highest percentage of inter-caste marriages, also has a reasonably balanced market.

So should we explore if there exists a relationship between the proportion of women married to men of the same caste and how balanced the marriage market is in the state with respect to caste? The hypothesis, based on the example of Punjab in the above two graphs, is that the greater the incidence of inter-caste marriage in a state, the smaller the imbalance in terms of caste in the market. Let’s do a scatter plot which includes the above two bar plots and see for ourselves:

intercaste3

On the X axis we have the percentage of women married to men of the same caste. On the Y axis, we have the absolute value of the imbalance index (in other words, we don’t care which way it is imbalanced, we only want to know how imbalanced the caste dynamics in marriage is in each state). The blue line is the line of best fit. Notice that it slopes downward. In other words, the greater the number of same caste marriages, the smaller is the imbalance between women marrying above and below their own caste, which is interesting. Notice that Punjab sits all alone as an outlier at the bottom left of the above graph! Kerala is an outlier at the top left corner!

Now you might posit that if fewer people are available for inter-caste marriage, the difference between those “marrying up” and those “marrying down” is bound to be lower, since the sum is lower. However, if we normalise the index for each state by the proportion of inter-caste marriages in that state, the above graph will still look pretty much the same!

Caste and marriage are more complicated than we think!

Bonuses and federalism

I spent a couple of years working for an investment bank, and the way they would distribute (the rather hefty) bonuses in the organization was rather interesting. Each manager in the firm would receive two sums – the first was his own bonus, and the second was the bonus to be distributed among all his subordinates. If any of the said subordinates were managers themselves, they would similarly receive two sums – separately for themselves and for their subordinates.

This is pertinent in relation to the devolution of power between the states and the third level of government. Even though district, taluk and city governments have been empowered by the 73rd and 74th amendments, they don’t have much real power because their finances are controlled by their respective state governments. In banking terms, this is like giving a manager one pot, and asking him to divide it between himself and his subordinates. The incentive is obviously to distribute the minimum amount possible to keep the subordinates happy. And this is exactly what is happening to federalism in India today.

What we need is a strict rule-based formula of distribution of central government revenues between the central governments, states and the next level (rule can be made based on populations, etc.). What we also need is a requirement for states to enact similar rules to divide revenue between states, districts and sub-districts in a rule-based manner. Until this happens, true federalism will remain a pipe dream.

Wheat and rice production revisited

Towards the end of last month we had looked at states in India with the maximum land under wheat and rice cultivation (both on an absolute and a relative basis). We revisit that topic of rice and wheat cultivation here, except that now we look at production (in KG) and productivity (KG per hectare). The data at data.gov.in spans from 1998 to 2010, but data for all states is not available for 2009 and 2010, so assuming that production patterns don’t change drastically, I’ve used data from 2008 to look at the biggest producers of these commodities.

Four figures offered without further comment.

1. Top wheat growing states in India (as of 2008)

wheat1

2. Productivity growth in major wheat growing states of India

wheat2

3. Top rice growing states of India (as of 2008)

rice3

4. Productivity growth in major rice growing states of India

rice2

Measuring Income

Earlier today I was reading an interview in the Business Standard with Shaibal Gupta, Secretary of the Asian Development Research Institute and member of the Raghuram Rajan committee on composite development index of states.  Gupta wrote a dissenting note to the report, with his main contention being the use of the Median Per Capita Expenditure (MPCE) as a measure of income to compare states rather than using the Per Capita Gross State Domestic Product (GSDP). I must state up front that I agree with the report here, and will use this post to defend my stance. Meanwhile, I must mention that one of the reasons he gave for using the GSDP (“Per capita income is taken as an indicator for this purpose by a number of institutions, including the Planning Commission and Finance Commissions.”, he said) almost made me fall off my chair.

Suppose you run a manufacturing company. Your production facility is located in Hosur, Tamil Nadu. However, for administrative convenience, and for the convenience of your top management, you have decided to headquarter your company in Bangalore, Karnataka (for the record, Hosur is just about 35 kms from Bangalore). Most of your workers live in Tamil Nadu, and draw their salaries there. Your top management gets compensated in Karnataka, and they live there. The question is how your company contributes to the economies of the two states.

From an accounting perspective, all your sales are attributed to Karnataka, for you are headquartered there. Of course, what your workers in Tamil Nadu spend out of their salaries will be accounted for in that State’s GDP but the overall sales of the company itself will be attributed to Karnataka, even though the company does next to no economic activity there. With the simple act of locating your company headquarters in Karnataka, you push up Karnataka’s GSDP while reducing Tamil Nadu’s. Some states (eg. Maharashtra and Delhi) are much more popular than others for the location of company headquarters, and they can lead to a fairly distorted figure of how much is produced in each state.

That is not all. The problem with Per Capita GSDP is that it is a mean figure, and is thus liable to be grossly affected by extreme values. Let us say we are comparing the income levels in two neighbourhoods. Neighbourhood A has 1000 people.999 of them earn Rs. 100 per month while the 1000th earns Rs. 1 crore per month. Neighbourhood B also has 1000 people but each of them earns Rs. 10000. Which neighbourhood is richer?

If you go by the mean income, the mean income of A is Rs. (999 * 1000 + 1 * 1,00,00,000)/1000 = Rs. 10999. The mean income of B is Rs. 10000. So you would say that A is richer than B. While on an average that might be true, you might notice that the number for A is skewed by the one rich guy. What this hides is the fact that 99.9% of A earn only a tenth of B’s mean income. Can we do better?

Instead of looking at what the resident of a neighbourhood makes on an average, what if we instead measure what the average person in the neighbourhood makes? In other words, what if we measure the median income in each neighbourhood? The advantage with the median is that it doesn’t get skewed by extreme values, as is likely in case of a variable such as income which usually follows a power law distribution. In our example above, the median income of A is Rs. 1000 while that of B is Rs. 10,000 which is probably a better reflection of the richness of the average resident of these two localities.

Similarly, the per capita GSDP, being a mean measure, is not a great measure for determining the richness or poorness of the people of a state. Suppose, for example, that neighbourhoods A and B are two states. Notice that A will have a much larger GSDP compared to B, and that this tells us nothing about the richness of the average resident of these two states.

Putting both above reasons together, you realize that the per capita GSDP is not  a great estimator of the richness of a particular state.

So what do we use? We discussed above that median income is a much better metric than the mean income. So can we use that for measuring richness instead? While it sounds good in theory, we have a practical and accounting problem – given that a large part of the country is essentially a cash economy, it is hard to keep track of people’s incomes. Moreover, there are enough reasons to both under-report and over-report one’s income if you were to ask someone as part of a survey. For this reason, the general consensus among development economists is that total consumption expenditure is a good estimate of income among the poor, whose net savings rate is negligible.

What about the non-poor, you may ask. Notice that we are only trying to capture the expenditure of the median resident of a state, and assuming that more than 50% of a state is within an income level at which income equals consumption expenditure is fair. So the median per capita expenditure will give a good picture.

So how do we estimate this? Unfortunately, we don’t have any accounting statistics that capture this, and we need to rely on surveys. The National Sample Survey Organization (NSSO) conducts surveys on people’s consumption expenditure every five years, and this is what the Rajan committee has used. Now, you may question the wisdom of relying on sample data (rather than “population data”) to determine the richness or poorness.

The answer to that is that the median is a rather robust statistic, and as long as samples have been chosen at random, it is unlikely that the median of a sample will be too far away from the median of the population (and this is independent of the distribution of the population). We will examine the issue of sampling median in a subsequent post.

In conclusion, I endorse the decision of the Rajan committee to use the median per capita consumption expenditure as a metric for determining the richness or poorness of a state.

Largest crop by state

We will continue to stick with the state-wise data on agriculture. In this edition, we will look at the largest crop by state, by year. We define this as the crop with the biggest acreage in the state.

No fancy visualizations here. Just data presented in a table. Two tables, actually, one for kharif and one for rabi. For each year these two tables show the biggest crop per state.

Offered without comment.

Major Crops in India, by State: Kharif Season
Major Crops in India, by State: Kharif Season

 

Major Crops in India: Rabi Season
Major Crops in India: Rabi Season

(click on images for larger size)

 

The Raghuram Rajan Committee report on Composite Development Index of States

In July this year, at a resort near Bangalore (yes, we at Takshashila do sometimes play resort politics) I got the fifth batch of the GCPP to work on the problem of building an index which measures the development of various Indian states in the last 10 years. I used this case as a reference while doing my module on Analytical Methods in Public Policy. This was as part of one of the weekend workshops which are part of the GCPP. As part of this exercise I taught them how to pick variables, how to measure them, procure data, look for interactions between variables and then combine them to form an index.

It is interesting that a couple of months after that session, the report of the Raghuram Rajan Committee on Composite Development Index of States has been published. I will use this blog post to give my comments on that report as I go through it. Since I’m going to be effectively “live-blogging” my reading of the report, the rest of this post is in bullet points.

Also, in keeping up with my title of “resident quant” I will try as much as possible to restrict my comment to the data and methodology, and not comment on economic issues. However, it is likely that I might go on economic rants here or there.

  • The first paragraph of the executive summary states that the reason we adopted a command and control model after independence was so that we don’t increase the inequalities across regions and states. This is the first time I’m hearing this story
  • The index is based entirely on publicly available data. I think this is a good thing.
  • Each state gets 0.3% of the total available pool, irrespective of its size. Of the remaining 91.6% (28 states => 8.4% fixed payment), 3/4th will be distributed based on “need” and 1/4 on “performance”. Nearly seventy years since independence, I’m of the opinion that this ratio should be less skewed towards “need”
  • Arbitrary cutoffs have been drawn at scores of 0.6 and 0.4 to classify states as “least developed” and “less developed”. While these are round numbers, I’m not yet sure they make sense.
  • The report alludes to the “resource curse”, which is a good thing.
  • Quote: “The Normal Central Assistance (NCA) grant, which is distributed to states as per Gadgil-Mukherjee formula based on categorization of “Special Category” and “General Category” states, constituted only about 3.8 per cent of total resources transferred to States and 8.2 per cent of plan transfers.” (emphasis mine)
  • The underdevelopment index has ten components. I won’t comment on the wisdom of the number of quality of the components chosen.
  • It is a good thing that Mean Per Capita Consumption Expenditure is used as a measure of richness rather than per capita Net State Domestic Product. As the report argues, the latter can include economic activity that doesn’t really reach the people, and is hence not as good a measure as consumption expenditure.
  • Table 1 (on page 17 of the report) gives the correlations between the metrics chosen. I think it is a fantastic thing that they have chosen to present the correlations in the first place (something ripe to be pushed under the carpet). As expected, a number of chosen variables are highly correlated.
  • Correlation between Consumption Expenditure and Urbanisation is 75%!! Similarly, correlation between expenditure and female literacy is 58%.
  • Then comes the damp squib – the excitement induced by presenting the correlation table is doused by the statement that each of the ten parameters are going to be accorded equal weight. This is disappointing on several counts: firstly, the sheer arbitrariness (remember that ‘equal allocation’ is as arbitrary as any other distribution). Next, that the correlations are thrown out of the window and certain factors are likely to get more weight. Then, the fact that this makes it easily manipulable by adding or deleting factors of choice. I’m so disappointed by this one decision that I’m putting this entire point in boldface. Apologies. 
  • The report acknowledges that broadly categorizing states into “developed” and “under developed” creates issues of moral hazard. However, rather than fully doing away with the division, the committee (again, disappointingly) takes a “middle path” by splitting two categories into three. I suspect some mathematical brain is involved here, in that the next committee will increase number of categories to four, and the one after to five, until a time when each state (finally!!) becomes its own category
  • To convert per capita allocation to state-wide allocation, the formula uses a combination of population and area. I agree that it is tougher to provide infrastructure to thinly populated  areas, so this combination is fair. It reminds me of my days in airline cargo pricing when we would similarly adjust between the weight and volume of a piece of cargo.
  • Performance index is computed based on changes in the development index over time. This is a good thing. Shows the committee is “eating its own dog-food”
  • This is the first time “performance” is being used as a criterion for fund distribution. So the 25% weight is a good start. I retract my earlier abuse of this ratio.
  • The committee recommends that this analysis be carried out every five years, since a good amount of data used in calculating indices are published at that frequency. Also considering that’s the frequency of finance commissions, it is a good thing.
  • The report tries to bolster its credibility by showing that the index is highly correlated with the UN Human Development Index. I like it that a scatter plot and regression line have been presented
  • The allocation based on performance is again skewed in favour of less developed states. So you are likely to get more if 1. You are underdeveloped and 2. You have shown an improvement. I think this is fair.
  • One good thing is that the formula is plug and play. It is “timeless” in a sense. At any given future point in time, you can simply look up the data points that are required and just construct the index. There is no human intelligence required for that effort
  • There is heavy reliance on NSSO data, and I’m not sure that’s a good thing since it is “survey data”. I think it might have been better to have used data from census.
  • The committee actually examined the option of weighting factors based on squared factor loadings from a Principal Component Analysis (*applause*) and found that the index thus constituted was 99% correlated with the one using simple arithmetic averages, and thus decided to go with the simpler formula. I’ll still continue to keep the earlier point in bold, though
  • Each “sub-component” was normalized between 0 and 1 using a simple linear formula (higher number indicating greater under-development). I like it that they used this rather than a rank ordering metric.
  • The report includes a sensitivity analysis to show that the ranking and index values are robust. Again, applause
  • A dissent note from Committee member Shaibal Gupta indicates that there are problems in using a simple weighted average rather than data from the PCA

Finally, despite all the talk of transparency and ease of calculation, the report itself does not contain either the index number or the component values for various states. I hope the data has been released (and if it has, please help me by giving me the link). If not, we should campaign for the data to be given out to the public in a CSV (or equivalent) format through the government data portal http://data.gov.in

 

Difficulty of Indian Education Boards

With the IITs now having a requirement that students should have scored in the top 20 percentile of their respective boards in order to qualify for admission, we have a chance to evaluate the relative difficulty of various Indian boards.

The IIT Delhi website has the cutoffs for each board. These cutoffs represent the “80th percentile scores” for each board, i.e. if you were to  rank all students who took that particular board exam, these are the marks scored by students 80% from bottom. If you have written any of these board exams and got more than the corresponding 80%ile score for your board, you are eligible to join IIT (provided you score sufficiently in the JEE-main and JEE-advanced, of course).

This plot shows the cutoffs (80th %ile score) for various boards:

Source: http://jee.iitd.ac.in/percentile2013.pdf
Source: http://jee.iitd.ac.in/percentile2013.pdf

Note that the four southern states are on top. These states are reputed to have high educational attainment. Could this be a consequence of easier board exams in these states? We don’t know.

Also, interestingly, these four states are followed by ISC and CBSE, before other state boards. Interestingly, the cutoff for ISC is higher than that for CBSE, which flies against conventional wisdom that CBSE is “easier” than ISC.

Also, if you look at the data, some states have more than one board, and the JEE council has used separate cutoffs for each of these boards. For purpose of my analysis I’ve arbitrarily chosen one board for each state – typically the one whose total is the “roundest” number.

 

Value of skill in rural India

Earlier today I had blogged about wage rates for unskilled workers in rural India. Now, we will use the same dataset and see what premium people pay for skills. The same data gives wages for certain occupations – carpenters, masons, cobblers, blacksmiths, etc. There are also wages given for various types of farm labour, and for the purpose of this exercise I’ve used ploughing to be representative of farm labour.

The following plot shows the wage rates for different skills in different states. A note on how to read this graph. The x axis represents the state and the y axis represents the daily wage for that particular skill. The skill itself is represented in text form. So for example a carpenter in rural Kerala gets about Rs. 600 per day while a sweeper in Bihar gets about Rs. 100.

Source: Labour Bureau. Numbers for April 2013
Source: Labour Bureau. Numbers for April 2013
  • Notice that even skilled jobs in other states don’t fetch as much as an unskilled job in Kerala. Tamil Nadu and Punjab come closest
  • The skills most in demand in rural areas across states are carpentry and masonry, if you go by this data
  • In most states, cobblers earn lower than “unskilled workers”. This is interesting because there is skill involved in making and repairing shoes. The low wages for cobblers indicates a caste bias. It is also possible that since cobblers are mostly self-employed their wage rate is inaccurate
  • Blacksmiths are again not too highly valued in villages
  • The high numbers for Kerala could be a function of the state’s lower urban-rural divide compared to the rest of India. Kerala is generally described as a semi-urban continuum with no strongly delineated urban and rural areas. Rural workers could be expensive since they are in demand for urban jobs also, unlike in other states.

 

 

The same caveats that apply to the previous post apply to this. We don’t know the sample size or the accuracy of the survey. Nevertheless, some interesting insights come out.

Wage rates in rural India

The Labour Bureau, affiliated to the Union Ministry of Labour, does a monthly survey on wages in Rural India. Wages of men and women in select occupations are polled (data is collected by the NSSO) and published on the website of the labour bureau. In this post we will look at the average daily wages of unskilled male workers (as reported by this survey) in the 20 states for which it is published (your guess is as good as mine as to why it is not published for other states).

Source: Labour Bureau statistics
Source: Labour Bureau statistics

It is interesting to note that the daily wage of the average unskilled man in Kerala is almost five times that of the average unskilled man in either Gujarat or Madhya Pradesh (states that are at the bottom of the list). Some states known to be “progressive” such as Punjab, Haryana and Tamil Nadu are also towards the top of the list while other so-called “progressive” states such as Maharashtra, Karnataka and Gujarat are close to the bottom.

Like any other data put out by the government, this should be taken with some salt. First of all, the sample sizes is not mentioned. Secondly, only the average number is reported and no measure of dispersion is given. For example, it is hypothetically possible that in Kerala they interviewed ten workers, nine of whom received Rs. 100 and the tenth received Rs. 4000 leading to an average of Rs. 490! As a thumb rule, when you put out survey data, you should always include sample size and a measure of variation (such as the standard deviation), else it is hard to conclude anything from the data.