Over the last four years, one of my most frequent commutes in Bangalore has been between Jayanagar and Rajajinagar – I travel between these two places once a week on an average. There are several routes one can take to get to Rajajinagar from Jayanagar, and one of them happens to be from the inside of Chamrajpet. However, I can count the number of times I’ve taken that route in the last four years on the fingers of one hand. This is because the first time I took that route I got stuck in a massive traffic jam.

Welcome to the world of real distributions and observed distributions. The basic concept is that if you observe a particular event rarely, the distribution you observe can be very different from the actual distribution. Take for example, the above example of driving through inner Chamrajpet. Let us say that the average time to drive through that particular road on a Saturday evening is 10 minutes. Let us say that 99% of the time on a Saturday evening, you take less than 15 minutes to drive through that road. In the remaining 1% of the time, you can take as much as an hour to drive through the road.

Now, if you are a regular commuter who drives through this road every Saturday evening, you will be aware of the distribution. You will be aware that 99% of the time you will take at most 15 minutes to get past, and base your routing decision based on that. When it takes an hour to drive past, you know that it is a rare event and discount it from your future calculations. If, however, you are an irregular commuter like me and happened to drive through that road on that one day when it took an hour you get past, you will assume that that is the average time it takes to get past! You are likely to mistake the rare event as the usual, and that can lead to suboptimal decisions in the future.

In his book The Black Swan, Nassim Nicholas Taleb talks about the inability of people to model for rare events. He says that the problem is that people underestimate the probability of rare events and fail to account for it in their models, leading to blow ups when they do occur. While I agree that is a problem, I contend that the opposite problem can also be not ignored. Sometimes you fail to recognize that what has happened to you is a rare event and thus end up with a wrong model.

Let me illustrate both problems with the same example. Think of a game where 99 times out of 100 you win a rupee. The rest of the time (i.e. 1%) you lose fifty rupees. Regular players of the game, who have “sampled” this enough will know the full distribution, and will take that into account when deciding on whether to play the game. Non-regular players, however, don’t have complete information.

Let us say there are a hundred cards. 99 of them have a +1 written on it, and the 100th has a -50. Let us suppose you pick ten cards. Ninety percent of the time, all ten cards you pick will be a “+1”, and you will conclude that all cards are “+1”. You will model for the game to give you a rupee each time you play. The other 10% of the time, however, you will draw nine +1s and one -50. You will then assume that the expected value of playing the game is Rs. -4 .1( (9 * 1 + 1 * (-50))/10 ). Notice that both times you are wrong in your inference!

So while it is important that you recognize black swans, it is also important that you don’t overestimate their probability. Always remember that if you are a rare observer, the distribution you observe may not reflect the real distribution.

Isn’t this simply about ensuring that your sample size is large enough and unbiased before making any observations?

it is. but it is a common mistake. and sometimes it is not feasible to have a “large enough” sample size.