Stratified sampling allows you to avoid pre-existing differences between groups in your experiments along metrics or the distribution of users across arbitrary attributes.
For B2B tests, or tests where a tail-end of power users drive a large portion of an overall metric value, this feature meaningfully reduces both false positive rates and false discovery rates, making your results more consistent and trustworthy. In our simulations, we saw around a 50% decrease in the variance of different A-A test results.
Running an online experiment means fighting against a slew of factors that can make your experiment analysis fail to produce useful results. Modern experimentation platforms have a lot of tools to help address these. For example:
Winsorization or capping helps to reduce the influence of outliers
CUPED can give you more power in less time
Sequential testing lets you peek without inflating your false positive rate
SRM checks detect imbalanced enrollment rates
Pre-experimental bias detection helps you know if your experiment - by bad luck - randomly selected two groups that were different before the experiment ever started
When we launched pre-experiment bias checks, we realized that this feature was useful for visibility, but fundamentally didn’t solve the problem. Customers were asking for guidance on how to use and prevent issues.
Unfortunately, in cases of extreme bias, there’s not much to do except restart the experiment with a new “salt”—meaning throwing out the data you’ve already collected.
At the same time, Statsig has many customers running small-unit-count experiments, especially in the B2B space.
In supporting these customers, a common issue we encountered is that a handful of end users would drive the majority of the distribution. Even with winsorization, having an imbalance in this segment would drive an imbalance in experiment results orthogonal to the true effect of the experiment.
Stratified sampling dramatically reduces the incidence of both of these issues.
The Statsig SDKs use a salt to randomize or bucket experiment subjects. When you enable stratified sampling, we'll try n different salts (100 for now) and evaluate how "balanced" your groups are.
We evaluate this balance based on either a metric you pick or an attribute you give us describing your experiment subjects. We pick the best salt from this set and save this as the salt to use.
Salt selection will help you to run more experiments in less time. Your results will be meaningfully more consistent and trustable, boosting your experiment’s confidence level and reducing your false positive rate at the same power.
When running multiple experiments, restarting the same experiment may result in different outcomes. Stratified sampling reduces this variability significantly.
In practice, you will configure stratified sampling before running an experiment in Statsig by configuring a metric or attribute to balance across.
For metrics, Statsig will striate users into buckets and do a weighted chi-squared across the buckets to find a good salt. For attributes, Statsig will perform the same analysis but with the buckets you’ve defined. This only takes a few minutes, and can become a part of your standard experiment workflow.
Read now: Statsig's stratified sampling documentation
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾
Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾
Learn how the iconic t-test adapts to real-world A/B testing challenges and discover when alternatives might deliver better results for your experiments. Read More ⇾
See how we’re making support faster, smarter, and more personal for every user by automating what we can, and leveraging real, human help from our engineers. Read More ⇾
Marketing platforms offer basic A/B testing, but their analysis tools fall short. Here's how Statsig helps you bridge the gap and unlock deeper insights. Read More ⇾