selection bias – Pertinent Observations

Everyone can be above average

All it requires is some selection bias

There were quite a few teachers during my time at IIT Madras who were rumoured to have said the line “I want everyone in class to be above average”. Some people credit a professor of mathematics for saying this. At other times, the quote is ascribed to a lecturer of Engineering Drawing. In the last 20 years I’m sure even some statistics professors would have been credited with this line.

The absurdity in the line is clear. By definition, everyone cannot be above average. The average is a measure of central tendency. However you define it (arithmetic mean, geometric mean, harmonic mean, median, mode), the average is by definition a “central value”, meaning you will have numbers both above and below it. In the worst case (assuming you are using a mode or median for a highly skewed distribution), there will be a large number of data points EQUAL to the average. Everyone cannot be above (strictly greater than) average.

However, based on some recent incidents, I figured out a way in which everyone can actually be above average. All it takes is some kind of selection bias. Basically you need to be clever in terms of how you count – both when you calculate the average and when you define the “everyone”.

Take one example – you have an exam you need to pass to go from Grade 1 to Grade 2. Let’s say the class average (let’s use the simple mean here) is 41, and you need to have scored at least 40 to pass. Let’s also assume that nobody has scored exactly 40 or 41.

Now, if you come back next month and look at the exam scores of all the Grade 2 students, you will find that all of them would have scored strictly more than 41 – the old “average”. In other words, since the below average students are no longer part of the sample (since they have “not passed”), everyone left is above average! The below average set has simply been eliminated!

Another way is simple relative grading. Let’s say there are 3 sections in the class. Telling one section that “everyone should be above average” is fairly legit – all it says is that this particular section should outperform the others so significantly that everyone in this section will be above the average defined by all sections!

It is easier to do in code – using some statistical packages, as long as you slip in a few missing values into your dataset, you will find that the average is meaningless, and when you ask your software for how many are above average, the program defaults can mean that everyone can be classified as “above average” (even the ones with missing values).

I must have recommended this a few times already, but Darrell Huff’s 1954 book How to Lie With Statistics remains a masterpiece.

It’s not just about status

Rob Henderson writes that in general, relative to the value they add to their firms, senior employees are underpaid and junior employees are overpaid. This, he reasons, is because senior employees trade off money for status.

Quoting him in full:

Robert Frank suggests the reason for this is that workers would generally prefer to occupy higher-ranked positions in their work groups than lower-ranked ones. They’re forgoing more earnings to hold a higher-status position in their organization.

But this preference for a higher-status position can be satisfied within any given organization.

After all, 50 percent of the positions in any firm must always be in the bottom half.

So the only way some workers can enjoy the pleasure inherent in positions of high status is if others are willing to bear the dissatisfactions associated with low status.

The solution, then, is to pay the low-status workers a bit more than they are worth to get them to stay. The high-status workers, in contrast, accept lower pay for the benefit of their lofty positions.

I’m not sure I agree. Yes, I do agree that higher productivity employees are underpaid and lower productivity employees are overpaid. However, I don’t think status fully explains it. There are also issues of variance and correlation and liquidity (there – I’m talking like a real quant now).

One the variance front – the higher you are in the organisation and the higher your salary is, the more the variance of your contribution to the organisation. For example, if you are being paid $350,000 (the number Henderson hypothetically uses), the actual value you are bringing to your firm might have a mean of $500,000 and a standard deviation of $200,000 (pulling all these numbers out of thin air, while making some sense checks that broadly risk pricing holds).

On the other hand, if you are being paid $35,000, then it is far more likely that the average value you bring to the firm is $40,000 with a standard deviation of $5,000 (again numbers entirely pulled out of thin air). Notice the drastic difference in the coefficient of variation in the two cases.

Putting it another way, the more productive you are, the harder it is for any organisation to put a precise value on your contribution. Henderson might say “you are worth 500K while you earn 350K” but the former is an average number. It is because of the high variance in your “worth” that you are paid far lower than what you are worth on average.

And why does this variance exist? It’s due to correlation.

More so at higher ranked positions (as an aside – my weird career path means that I’ve NEVER been in middle management) the value you can add to a company is tightly coupled with your interactions with your colleagues and peers. As a junior employee your role can be defined well enough that your contributions are stable irrespective of how you work with the others. At senior levels though a very large part of the value you can add is tied to how you work with others and leverage their work in your contributions.

So one way a company can get you to contribute more is to have a good set of peers you like working with, which increases your average contribution to the firm. Rather paradoxically, because you like your peers (assuming peer liking in senior management is two way), the company can get away with paying you a little less than your average worth and you will continue to stick on. If you don’t like working with your colleagues, there is the double whammy that you will add less to the company and you need to be paid more to stick on. And so if you look at people who are actually successful in their jobs at a senior level, they will all appear to be underpaid relative to their peers.

And finally there is liquidity (can I ever theorise about something without bringing this up?). The more senior you go, the less liquid is the market for your job. The number of potential jobs that you want to do, and which might want you, is very very low. And as I’ve explained in the first chapter of my book, when a market is illiquid, the bid-ask spread can be rather high. This means that even holding the value of your contribution to a company constant, there can be a large variation in what you are actually paid. And that is a gain why, on average, senior employees are underpaid.

So yes, there is an element of status. But there are also considerations of variance, correlation and bid-ask. And selection bias (senior employees who are overpaid relative to the value they add don’t last very long in their jobs). And this is why, on average, you can afford to underpay senior employees.

Indian Americans and the Selection Bias

There is this one chart from the Economist that has been doing its rounds over the interwebs over the last few days:

Basically it shows that Indian Americans are much more accomplished academically and professionally compared to other immigrants. And there are many theories floating around as to why Indians are so successful.

The answer, however, is rather simple – selection bias. Migrating from India to the US was an extremely difficult task till the 1960s – there were some quotas that the US had for immigration under which the Indians had nothing. And when Indians did finally start migrating in the 1960s, it was mostly for education.

Most people who migrated from India to the US even in the 1960s and 70s did so to go to graduate school. And this meant that they already had 16 years of education in India, which either meant an engineering or medical degree, or a masters in one of the other fields. So basically most Indians migrating to the US were highly accomplished already.

And considering the kind of foreign exchange controls imposed by the Indian government, the only Indians who could afford to go to the US for an education were those that received a fellowship or support from their universities. Thus increasing the seelection bias! (Now that I’ve mentioned foreign exchange controls, you should listen to this song, which was apparently meant to parody such policies)

Yes, you had the odd Patel without much education who made it to open a “Potel” (Patel run Motel), but that is probably the reason that the Indian bubble in the above chart is not farther out!

So that Indians have done better than other migrating communities in the US is not about innate Indian intelligence, or innate Indian ability to work hard, or because the Americans took in the Indians much better than other nationality. It is simple selection bias, based on tight immigration controls and tight emigration controls and stupid foreign exchange policy on the part of Indian government (which, at one point of time, only allowed citizens to take out eight dollars from the country).

To illustrate this point, look at the country that is “second” (quotes since we are looking at two dimensions here, so second is subjective) in this list – Iran.

Useless LinkedIn

I’m not a big fan of LinkedIn. I mean, I use it, and fairly regularly at that (check it at least once a day), and I think conceptually it’s quite useful. However, in practice, I think there are a number of sticking points about the service, which makes it quite useless.

For starters its apps (iPad and Android) are quite lousy, and offer nowhere close to the kind of experience that the web interface offers. Things are extremely unintuitive (down to the tabbing order – you compose message, hit tab and enter, and you don’t send the message. It takes you to the profile of the person you’re messaging instead) on the website. Sometimes the apps show notifications even after you’ve checked them on the web, and so on.

In other words it’s an extremely poorly engineered product, but which is surviving (and thriving) thanks to network effects!

I might have commented on this in the past but there is this thing on endorsements. This was something that coincided with the time when LinkedIn went public (if I’m not wrong), and you could endorse people for their “skills” on LinkedIn. For a while I played along with the game. But then I completely lost it when a distant uncle who I’m sure has never traded derivatives endorsed me for “derivatives”. I quickly deleted my skills.

Then there are the LinkedIn recommendations, which has inherent selection bias and hence adds no value. And then you have the “say goncrats” feature, where LinkedIn prompts you to “say congrats” on people changing jobs or hitting job anniversaries. I’ve found this mildly useful (dropping a note when someone switches jobs is a good way to stay in touch), but there are the bugs in terms ofjob downgrades and people getting fired.

And of late, there has been serious spam in terms of people’s status updates. I don’t know when it became popular to post silly puzzles on professional networking sites, yet I find several of them popping up on my timeline every day, and the number of people who have shared each is not funny. Then you have these cartoons (Dilbert and the copycats), and “guru quotes” that appear in the form of images that further spam your timelines! The only way I can think of these being useful is that they act as a negative indicator when you’re checking out the profile of someone you are looking to hire or do business with!

To summarise, LinkedIn seems to be an extremely badly engineered product on several counts, but thanks to network effects (so many people are already on it that entry barriers for competitors are really high) the site still manages to do well! I wonder what it will take to disrupt it. Facebook for business is not the answer for sure – the potential havoc caused by a breach in chinese walls there will scare people enough to not sign up.

What do you think? Here is their stock price movement for reference:

Selection bias and recommendation systems

Yesterday I was watching a video on youtube, and at the end of it it recommended another (the “top recommendation” at that point in time). This video floored me – it was a superb rendition of Endaro Mahaanubhaavulu by Mandolin U Shrinivas. Listen and enjoy as you read the rest of the post.

I was immediately bowled over by youtube’s recommendation system. I had searched for both Shrinivas and Endaro … in the not-so-distant past so Youtube had put two and two together and served me up an awesome rendition! I was so happy that I went to ~~town~~ twitter about it.

Google must have its algos right, for Youtube recommended this excellent rendition of Endaro.. to me https://t.co/HstqNo8nMm by U Shrinivas

— karthik (@karthiks) January 26, 2015

It was then that I realised that this was the firs time ever that I had noticed the top recommendation of Youtube. In other words, every time I use youtube, it recommends a video to me, but I seldom notice it. And I seldom notice it for a reason – they’re usually irrelevant and crap. The one time I like the video it throws up, though, I feel really happy and go gaga over the algorithm!

In other words, there’s a bias which I don’t know what its exactly called – the bias that when event happens in a certain direction, you tend to notice it and give credit where you think it’s due. And when it doesn’t happen that way, you simply ignore it!

In terms of larger implications, this is similar to how legends such as “lucky shirts” are born. When something spectacular happens, you notice everything that is associated with that spectacular event and give credit where you think it’s due (lucky shirt, lucky pen, etc.). But when things don’t go your way you think it’s despite the lucky shirt, not because the shirt has become unlucky.

It’s the same thing with belief in “god”. When you pray and something good happens to you after that, you believe that your prayers have been answered. However, when you pray and something good doesn’t happen, you ignore the fact that you prayed.

Coming back to recommendation systems such as Youtube’s, the problem is that it is impossible for a recommendation system to get recommendations right all the time. There will be times when you get it wrong. In fact, going by my personal experience with Youtube, Amazon, etc. most of the time you will get your recommendation wrong.

The key to building a recommendation system, thus, is to build it such that you maximise the chances of getting it right. Going one step further I can say that you should maximise the chances of getting it spectacularly right, in which case the customer will notice and give you credit for understanding her. Getting it “partly right” most of the time is not enough to catch the customer’s attention.

Putting marketing jargon on it, what you should focus on is delighting the customer some of the time rather than keeping her merely happy most of the time!

Selection bias in Catalunya?

Catalunya, where I spent two weeks last month, votes today in an “informal referendum” on whether to secede from Spain. This vote is non-binding after the Spanish Supreme Court declared an earlier “official referendum” called by the Catalan government as illegal. As I write this (11 pm IST; 6:30 in Catalunya), FT reports that there were “long lines to vote” in the informal referendum today. The same report in the opening paragraph mentions “with an overwhelming majority expected to back a proposal to break away from the rest of the country and form an independent state“.

Looking at it from a pure numbers perspective, this outcome is not to be unexpected. Consider two hypothetical voters and Barcelona residents Jordi and Jorge (the more observant reader might observe that these names have been chosen carefully) who are respectively for and against the secession. What are their incentives to come out and vote today, as against in a “real referendum”?

As far as Jorge is concerned, today’s vote doesn’t matter to him. Given that today’s referendum is “informal”, which way it goes has, in Jorge’s opinion, absolutely no impact on his life. Thus, he will consider the time and energy he would have to expend in queueing up and voting today to be not worth it. And thus he will not bother. And get on with his life. If today’s referendum were “real”, though, Jorge would have every incentive to register his opposition in the hope that his vote would help tip the vote towards a “no”, and thus he would be voting.

What about Jordi? Even though Jordi knows that today’s vote is only “informal”, he wants to send out a statement that he is in favour of secession. The way he sees it, the larger the majority by which today’s vote will come out in support of secession, the stronger the message that will be sent to Madrid, which he hopes should sooner or later be forced to relent, and permit a real independence vote. As far as Jordi is concerned, today’s vote has tremendous signalling value, and to this end he has every incentive to expend his time and energy and queue up and vote!

Based on this more Jordis are likely to come out today to vote, while less Jorges are likely to do so. Which means that today’s vote, thanks to the selection bias of one side being much more disposed to vote than the other, is likely to throw up a skewed result! Thus, it makes sense to treat the results with some salt.

But what about higher order effects? It is not hard for Jordis and Jorges to see what I’ve written above. Knowledge of this is not likely to change Jordi’s stance – just “victory” in today’s referendum is not enough for him. He is using today’s vote to primarily make a statement and the larger the “majority” that can be shown in favour of a “Yes” vote, the better it is for him. So the second order effects will not affect Jordi.

What about Jorge? He understands that while his vote doesn’t really matter since today’s referendum is not real, he knows that most people in favour of the referendum are likely to be voting today. Thus, the referendum is going to show an inflated majority in favour of “Yes”. So should he vote today to balance things out? On the one hand the effort might be worth it in terms of reducing the majority for the “Yes” vote. But then again, Jorge will realise that the selection bias in today’s vote is very very apparent, and his effort in marginally reducing the majority in favour of “Yes” may not actually be worth it! And so he will not vote.

So it is clear that today’s vote will show a significant majority in favour of secession from Spain, but that this is likely to be vastly overstated and very different from what things would be like had today’s vote been real. In that sense, if Spanish Prime Minister Mariano Rajoy and his advisers are smart, and realise the selection bias that is inherent, they can render today’s “informal referendum” rather pointless.

Booze and volatility

Another of those things I’ve been intending to write for a really long time. Occasionally when I’m not feeling too good mentally, people ask me to go have a drink telling me that everything will be alright. However, given my limited experience in this I’m not too confident it will work. In fact, the only one time I tried drowning my sorrows in alcohol (this was over four years ago) I ended up feeling significantly worse, worse enough to have not tried it since.

The thing with booze is that it increases the volatility of your state of mind. This means that it will flatten out the curve according to which your mental state moves. So after you’ve had a drink or few, you are unlikely to remain in the same state that you were in that you started off at. You end up feeling either significantly better or significantly worse – and the chances of both these go up tremendously when you drink.

I know I have been so far acting based on one data point that went adversely, but I don’t know what causes the selection bias in people who have been through both sides significantly! Of feeling much worse and feeling much better after having some drinks. Why is it that even though all of them would’ve been through significantly worse after drinking at some point of time or the other, they tend to forget about it and only think of the times when they’ve felt better?

Is it that whether you feel good or not is some kind of a binary payoff depending upon the level of the state of mind (basically state of mind < cutoff => “bad”; state of mind >= cutoff implies “good”)? If this is true, then whenever you are “out of the money” (feeling bad), you dont’ really care if you go even more out of the money – your overall feeling doesn’t change by much. And so you don’t really mind the cases when the alcohol starts making you feel significantly worse. But then the barrier is ahead of you so by increasing volatility, you are giving yourself a better chance of surmounting the barrier so drinking makes sense! But then under this condition it doesn’t make sense to drink at all when you’re already feeling good!

Are there any other reasons you can think of for this selection bias? Why do people give more benefits to positive movement in state of mind as a function of drinking than to negative movement in state of mind? Or is it that volatility is a non-intuitive concept and “there’s a better chance you’ll feel better if you drink” is a simple way of communicating it? And let me know your experience about drink making you feel worse..