Confused about p-values and hypothesis testing? Let’s play a game.

Mon Apr 18 2022

Tim Chan

Lead Data Scientist, Statsig

You get to flip a coin and if it’s heads, you win $10. If it’s tails, I win $10.

We play twice, tails comes up twice and you owe me $20. You probably will chalk this up to bad luck; after all there’s a 25% chance a fair coin will produce this result. So you decide to play 8 more times and get 8 more tails. That’s 10 tails out of 10 flips, you have now owe me $100 and I’m grinning ear-to-ear… are you suspicious yet? You should be, the chance of this happening with a fair coin is less than 1 in a thousand (<0.1%).

Somewhere between 2 and 10 coin flips is a point where you should call bullshit. I recommend picking a high threshold so you don’t use foul words due to everyday bad luck. But you don’t want it to be too high because you’re not a sucker. I suggest you call me out if the outcome has a less than 1 in 20 chance of occurring (<5%). This means if you get 4 tails out of 4 (a 6% chance), you chalk it up to bad luck. If you get 5 tails out of 5 (a 3% chance), you decide you were cheated and call bullshit.

Congrats, you now understand Frequentist hypothesis testing! You assumed the coin was fair (the null hypothesis), and only when we ended up with a result below a reasonable threshold did we call bullshit (5 tails out of 5 flips, <5%). We rejected the null hypothesis, meaning we accept the alternate hypothesis that the coin was biased.

Congrats! You’ve just learned hypothesis testing for $50.

Major Misconceptions to Watch Out For

1. “There is a 95% chance the coin is bad.”

This is the most common misconception around p-values, confidence intervals, and hypothesis testing. Hypothesis testing does not tell us the probability we made the right decision; we simply don’t know. To know this requires information like: did the coin come from yours or my pocket? Was I just inside a magic shop? Do I have a large stack of money I’ve won from other people? While these answers should affect your estimate of the chances the coin is unfair, it’s really hard to objectively quantify it. Instead, hypothesis testing ONLY tells us that the result is odd when we assume the coin is fair.

This is directly applicable to AB testing… we don’t know the probability that a test will work and guessing only introduces bias. Instead we assume there will be no effect, and only if we see an unlikely result will we make a big deal of it. The cool thing about hypothesis testing is it’s unbiased, and doesn’t require us to estimate the chance of success (which can be a highly subjective process).

2. “There is a 5% chance we’re wrong”

We have the confusing definition of p-values and significance to blame for this. A p-value of 0.05 means that the result (and anything as extreme) has a 5% chance of occurring under the null hypothesis. In our example, we’re stating that the outcome would has a <5% chance of occurring IF the coin is fair. This is also called the false positive rate, and it is something we do know and can control, but it’s not the same as knowing the chance we’re wrong.

3. “We know how bad the coin is.”

We know that the outcome is unlikely if the coin was fair, so we concluded it must not be fair. But we don’t know how the coin truly behaves: Does it have two tails? Or is it only 60% biased? We were only able to reject the null hypothesis and conclude that the coin isn’t fair. It’s somewhat standard practice to accept the observed result (5 times out of 5 = 100%), with some margin of error as our best guess of the coin’s behavior (after rejecting the null hypothesis). But the truth is that many different degrees of biased coins could have easily produced this result.

4. “This isn’t trustworthy, we need a larger sample size.”

This misconception largely originates from AB testing leaders like Microsoft, Google, and Facebook who talk a lot about experimentation on hundreds of millions of users. Larger samples also do tend to give better tests. But statistical power is more than just sample size, it also depends on effect size. Small companies almost always see big effect sizes giving them MORE statistical power than large companies (See You Don’t Need Large Sample Sizes to Run A/B Tests). Many scientific studies are based on small sample sizes (< 20). The coinflip example required only 5 flips. The whole point of statistics is to identify which results are plausibly due to signal or noise; a small sample size has already been accounted for.

Statistical Aside —What about Peeking?

Some readers will call me out on the peeking problem which I ignored for simplicity. In a nutshell, the more times you peek at or reevaluate your results should affect your statistics. A correct way is to pick a fixed number of flips to make a decision before you start (this is called a fixed horizon test).

The smart folks at the Netflix experiment team wrote a more thorough and statistically rigorous explainer using coin flips on their blog post: Interpreting A/B test results: false positives and statistical significance). Be sure to check this out.

Thanks to ZSun Fu on Unsplash for the photo!


Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy