Why I DON'T Care About Statistical Significance

You know the world has come a long way when someone has to espouse the heresy of not caring about statistical significance.

This is not an argument against A/B testing, but rather about how we use A/B test results to make business decisions.  Instead of statistical significance, let’s make decisions based on expected value, i.e. $benefit × probability − $cost.

A little background on statistical significance, or “p < 0.05”. Say you have just deployed an A/B test, comparing the existing red (control) vs. a new green (test) “BUY NOW!” button. Two weeks later you see that the green-button variant is making $0.02 more per visitor than the red-button variant. You run some stats and see that the p-value is less than 0.05, and are ready to declare the results “significant.”  “Significant” here means that there’s an over 95% chance that the color made a difference, or more true to the statistics, there’s less than 5% chance that the $0.02 difference is simply due to random fluctuations.

That last sentence there is probably too long to fit in anyone’s attention span. Let me break it down a little. The problem here is that you need to prove, or disprove, that the difference between the two variants is real — “real” meaning generalizable to the larger audience outside of the test traffic. The philosophy of science (confirmation is indefinite while negation is absolute — a thousand white swans can’t prove that all swans are white, but one black swan can disprove that all swans are white) and practicality both require that people set out to prove that the difference is real by disproving the logical opposite, i.e. there is no real difference. Statistics allows us to figure out that if we assume there is no difference between the red- and green-button variants, the probability of observing a $0.02 or larger difference by random chance is less than 0.05, i.e. p < 0.05. That is pretty unlikely. So we accept the alternative assumption, that the difference is real.

What if you have a p-value of 0.4, i.e. a 40% chance of getting a $0.02 or larger difference simply by random fluctuations? Well, you may be asked to leave the test running for longer until it reaches “significance,” which may never happen if the two variants are really not that different, or you may be told to scrap the test.

Is that really the best decision for a business? If we start out with the alternative assumption that there is some difference between the variants, 60% of the time we will make more money with the test variant and 40% of the time we will lose money compared to the control. The net gain in extra-money-making probability is 20%. The expected size of the gain is $0.02. Say that we have 100K visitors each day, that’s $0.02 × 100,000 × 0.2 = $400 in expected extra revenue. It doesn’t cost me any extra to implement the green vs. red button. Of course I should go for the green button.

If we go back to the option of letting the test run for longer before making a decision, the upside is that we will have a more accurate estimate of the impact of the test variant. The downside is that, if one variant has $400 expected extra revenue each day, that’s $400 × (1 − traffic_in_test_variant%) extra dollars we are not taking in each day.

Now suppose you are so diligent that you keep rolling out A/B tests, this time testing a fancy search ranking algorithm. Two weeks later you see that there is a $0.10 increase in dollar spent per visitor for the test variant compared to the control (i.e. existing search ranking algorithm) variant. If the increase is real, with 100K visitors each day, that’s $0.10 × 100,000 = $10,000 dollars extra revenue each day. Now, let’s add a twist: you need five extra servers to support that fancy algorithm in production, and the servers cost $10,000 each to buy, and another $10,000 to run per year. You want to make sure it’s worth the investment. Your stats tell you that you currently have a p-value of 0.3, which most people would interpret as a “nonsignificant” result. But a p-value of 0.3 means that with the new ranking algorithm the net gain in extra-money-making probability is 0.7 − 0.3 = 0.4. With the expected size of the gain being $0.10 per visitor, the expected extra revenue per year is $0.10 × 100,000 × 0.4 × 365 = $1.46M dollars. The rational thing to do is of course release it.

Now, the $0.10 increase is the expected amount of increase. There is risk associated with it. In addition, humans are not rational decision makers, so a better theory is to use expected utility and include risk aversion in the calculation, but that’s outside of the point of this article. This article is about using statistical significance vs. expected value for making decisions.

Statistical significance is that magical point on the probability curve beyond which we accept a difference as real and beneath which we treat the difference as negligible. The problem is, as the above examples have demonstrated, probabilities fall on a continuous curve. Even if you do have a statistically significant result, a significance level of p = 0.05 means that 1 in 20 A/B comparisons will give you a statistically significant result simply by random chance. If you have 20 test variants in the same test, just by chance alone 1 in 20 of these variants will produce “statistically significant” results (unless you adjust the significance level by the number of variants).

The normal distribution (or whatever distribution you use to get the probabilities) does not come with a marker of statistical significance, much like the earth does not come with latitudinal or longitudinal lines. Those lines are added essentially arbitrarily to help you navigate, but they are not the essence of the thing you are dealing with.

The essence of the thing you are dealing with in A/B tests is probability. So let’s go back to the basics and make use of probabilities. Talk about benefit and probability and cost, not statistical significance. It’s no more than a line in the sand.


Notes:

1. The above examples assumed that the A/B tests per se were sound and that the observed differences were stable. To estimate the point at which the data is stable, use power analysis to calculate sample size.

2. Typical hypothesis testing procedure: to investigate whether an observed difference is generalizable outside of the test, we set up two competing hypotheses. The null hypothesis assumes that there is no difference between the two means, i.e. the two samples (e.g. two A/B test variants) are drawn from the same population, their means fall on the same sampling distribution. The alternative hypothesis assumes that the two samples are drawn from different populations, i.e. the means fall on two different sampling distributions. We start out assuming the null hypothesis to be true, and that the mean of the control variant represents the true mean of the population. We calculate the probability of getting the test variant mean under this assumption. If it’s less than some small number, for example p < 0.05, we reject the null hypothesis and accept the alternative hypothesis.


3. Significance levels are very much a convention and vary across disciplines and situations. Sometimes people use 0.01 or 0.001 instead of 0.05 as the significance level. As we all learned from the Higgs boson discovery, they need 5 sigmas (that translates to a p-value of about 0.0000003) to be officially accepted as a “discovery.” Traditional significance levels are biased strongly against false positives (claiming an effect to be true when it’s actually false) because of the severe cost in championing a false new theory or investing in a false new drug.