A/B testing with large samples

Out on Numbers Rule Your World, Kaiser Fung has a nice analysis on Andrew Gelman’s analysis of the Facebook controversy (where Facebook apparently “played with people’s emotions” by manipulating their news feeds. The money quote from Fung’s piece is here:

Sadly, this type of thing happens in A/B testing a lot. On a website, it seems as if there is an inexhaustible supply of experimental units. If the test has not “reached” significance, most analysts just keep it running. This is silly in many ways but the key issue is that if you need that many samples to reach significance, it is guaranteed that the measured effect size is tiny, which also means that the business impact is tiny.

This refers to a common fallacy that I’ve often referred to on this blog, and in my writing elsewhere. Essentially, when you have really large sample sizes, even small changes in measured values can be statistically significant. The fact that they are statistically significant, however, does not mean that they have a business impact – sometimes the effect is so small that the only significance it has is statistical!

So before you blindly make business decisions based on statistical significance, you need to take into account whether the measured difference is actually significant enough to have an impact on your business! It may not strictly be “noise” – for the statistical significance test has shown that it’s not “noise”, but it is essentially an effect that, for all business purposes, can be ignored.

PS: Fung and Gelman are among my two favourite bloggers when it comes to statistics and quant. A lot of what I’ve learnt on this subject is down to these two gentlemen. If you’re interested in statistics and quant and visualisation I recommend you to subscribe to both of Fung’s feeds and to Gelman’s feed.