Disclaimer: this is in good fun and I donât advocate road rage.
You, dear reader, are driving somewhere nice. However, the driver behind you is determined to make you have a bad day; theyâve come up uncomfortably close behind you and youâre pretty sure theyâre a bad driver who should feel bad about how theyâre driving.
Soon, they pass you, and the age-old question rings through your brain:
Do I need to let this person know that theyâre a jerk?
The decision above is of course a personal one, but it does have some interesting inputs.
Letâs say the world is simple, and people are either a âgood driverâ or a âbad driverâ. If we do flip someone off, but theyâre a good driver who made a mistake, weâre the jerk. Thatâs bad. But we donât want a bad driver to get away with bad driving without getting the âgiftâ of feedback! In experimentation weâd discuss this in terms of Type 1 and Type 2Â errors:
Our objective is to minimize error rateâââletâs try to flip off bad drivers and give a friendly wave to good drivers. However, optimizing for these two actions are at odds; if you just wanted to flip off bad drivers, youâd flip off everybody, and if you just wanted to wave to good drivers, youâd wave to everyone!
Thereâs three factors that will help us make our decision and minimize errors:
What % of the time does a bad driver drive badly?
What % of the time does a good driver drive badly?
What % of drivers are bad drivers?
Once we have these, we can calculate what the odds are that a bad driving incident was caused by a bad driver using Bayesâ theorem:
Related: Try the Statsig Bayesian A/B Test calculator.
For anyone unfamiliar with the notation, a brief primer:
The core statement is that we can divide the probability of âbad driving, AND bad driverâ by the total probability of âbad drivingâ to get the proportion of bad driving where a bad driver is at the wheel.
Since weâve split the world into âgoodâ and âbad driversâ, we can calculate the denominator, or the total probability of bad driving, by combining the probabilities of a âgood driver driving badlyâ and a âbad driver driving badlyâ:
Itâs sometimes easier to just draw this out:
This is loosely a âProbability Spaceâ diagramâââ
Without having had a bad driving encounter, your best guess as to if a random driver were good or bad would be the population average (in the chart above, 20% bad and 80%Â good).
But seeing them drive badly gives you information! Though it doesnât guarantee that theyâre a good or bad driver, it should change your expectations to them having a 56% chance to be a bad driver and 44% to be a good driver. This is referred to as âupdating a priorâ, and itâs a fundamental concept behind Bayesian statistics.
Letâs look at some examples of what this means in practice. Iâll talk about how we can easily come up with some these figures later, but for nowâŠ
Letâs use the same numbers as before and say that bad drivers have a 50% chance to drive badly and good drivers have a 10% chance to drive badly. Then we can draw a curve:
This curve represents how the â% of Bad Driversâ in the population influences the chance that the person who cut you off was a bad driverâââand therefore, if you should be mad!
Your other assumptions (how often bad drivers drive badly, and how often good drivers drive badly) will influence the shape of this curve. For example, if we change our assumption about how often good drivers drive badly, we get a very different shape:
Hereâs some example rules of thumb for your own driving needs:
This applies directly to (frequentist-based) experimentation and how people should interpret experimental results in science or business.
Letâs say you are running a/b tests (or Randomized Controlled Trials) with a significance level of 0.05 and a power level of 0.80. Letâs also pretend that we have a known âsuccess rateâ for your experiments.
When analyzing your results, your odds that there was a real liftâââgiven you had significant resultsâââcan be expressed as:
And your odds that there wasnât a real lift, given you had insignificant results, is:
If we plot those against different âSuccess Ratesâ, we get some useful curves:
What does this mean? These are basically a plot of how often your statistically significant results will reflect the âintuitiveâ interpretation of them.
Experiments with middling success rates will have the most trustworthy results across both types
At the tail ends (super risky, or super safe experiments), one of your interpretations will get flaky. For example, if you have a low chance of success, most of your âstat sig winsâ will be false positives due to noise. If you have a high chance of success, many of your âneutralâ results will be false negatives.
In practice, you may not know your success rate. Some companies like Microsoft or Google track success rates, and smaller companies tend to have more opportunities for wins (low hanging fruit).
Even without knowing your exact rate, though, you might have some intuition like âI think itâs 50/50 that this will drive my targeted liftâ, or âI think this is a long shot and itâs only 10%â. This can really help to guide your interpretation of significance in A/B test results. For advanced experimenters, you might consider changing your Power and Alpha to try to avoid flaky interpretations.
We can whip this up without much code in Python. You can copy the notebook here. Try not to judge my code too much đ.
We can define the conditional probability of a bad driver given the inputs above like so:
def p_bad_driver(bad_driver_rate, bad_driver_bad_rate,
good_driver_bad_rate):
good_driver_rate = 1-bad_driver_rate
numerator = bad_driver_bad_rate *
bad_driver_rate
denominator = bad_driver_bad_rate * bad_driver_rate +\
good_driver_bad_rate *
good_driver_rate
return numerator/denominator
Letâs check our ~56% number from above:
p_bad_driver(0.2, 0.5, 0.1) > 0.5555555555555555
Looks good! You can copy the notebook to play with any of the parameters and generate new versions of the charts.
I hope Iâve convinced you that conditional probability can be a useful way to think about how you interpret the world around you with this silly example. Itâs something we all do intuitively, and putting numbers on it helps to confirm that intuition.
This is really important to me as someone whoâs worked deeply with experimentation teams as a data scientist. Itâs often difficult to communicate our confidence in our results effectively. Thereâs a lot of probabilities, parameters, and conflicting results involved in testing, and Iâve seen it frustrate some really smart people.
At Statsig, weâre trying to strike the right balance between making experimentation easy and trustworthy, but also giving people the tools to deeply understand what their results meanâââitâs an ongoing, but worthwhile process!
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾
Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾
Learn how the iconic t-test adapts to real-world A/B testing challenges and discover when alternatives might deliver better results for your experiments. Read More ⇾
See how weâre making support faster, smarter, and more personal for every user by automating what we can, and leveraging real, human help from our engineers. Read More ⇾
Marketing platforms offer basic A/B testing, but their analysis tools fall short. Here's how Statsig helps you bridge the gap and unlock deeper insights. Read More ⇾