Stratified sampling in A/B Tests

Tue Jan 28 2025

Craig Sexauer

Data Scientist, Statsig

Imagine you're running experiments to fine-tune your product, but your results swing wildly in every experiment you run.

Stratified sampling might just be the tool you need to bring clarity and precision to your A/B testing efforts. This tool allows you to make sure the comparisons you make are truly fair and as "apples to apples" as possible.

Stratified sampling isn't just another buzzword; it's a robust statistical method that enhances the accuracy of your A/B tests, ensuring that every subgroup within your dataset is properly represented. This approach not only improves the quality of your data but also deepens your understanding of how different segments interact with your product.

Related reading: A/B testing 101

Introduction to stratified sampling in A/B testing

Stratified sampling is a technique used to partition a population into smaller, distinct subgroups or strata before sampling. This method is crucial in A/B testing as it ensures that each subgroup is adequately represented, thereby providing a more accurate, unbiased sample that reflects the diversity of the entire population. For the practitioner, this means that random false-positives driven by small, high-usage user groups are less likely.

The reason stratified sampling is so valuable in A/B testing boils down to its impact on precision and reliability. Stratified sampling reduces the false positive rate by enforcing the "identical" element of the i.i.d. assumption in experimentation.

By integrating stratified sampling into your A/B testing framework, alongside drilldowns, like those offered on most exprimentation platforms, you're not just experimenting; you'll also gaina precise understanding of how different segments of your user base respond to changes, allowing for more targeted and effective optimizations.

Designing stratified samples for A/B tests

When setting up your A/B tests, picking the right strata is step one. Think about what factors might affect the outcome—age, location, usage frequency? These are your strata.

Here’s how to nail down these crucial elements:

  • Identify key covariates: Look at past data to see which demographics or behaviors link closely with the changes you’re testing.

  • Categorize your users: Group them by these identified covariates. This ensures each category is tested.

There will be tradeoffs in balancing. Generally, groups with a small number of experimental units, but a large amount of metric contribution are the most important to balance.

If you have two groups that contribute 50% of your topline value each, and one has 100,000 users while the other has 10, it's far more likely that the group of 10 will end up split unevenly across your experiment groups. If 8 of them are in test, and 2 in control, even with no treatment effect you'd report an 85% lift! Stratified sampling prevents this from occuring.

By following these steps, you're setting your A/B test on a foundation built for insightful, actionable results.

🤖👉 Try now: Create your first A/B/n test.

Implementing stratified sampling in A/B tests

There are three common methods of stratification:

1.) Within your assignment solution. This is often implemented by keeping counters, per-strata, of assignments thus far and adjusting allocation rates to keep these in-check as the experiment progresses. This works for small experiments, or offline experiments, but can be challenging in a scaled real-time platform due to the expense and latency of looking up these indices and a user's existing assignments. Most platforms use a hashing algorithm to deterministically assign a user to the same group without having to do a database lookup for subsequent visits.

2.) Post-hoc sampling or tools like CUPED. It's possible to filter out "extra users" in one segment post-hoc; in the example above, we could randomly filter out 6 head users from the analysis to balance a 2-2 comparison. The cost is losing some critical data points.

CUPED, if implemented perfectly, can also functionally stratify your data by a covariate. This does require you to correctly set up the regression such that you have perfect coverage of the stratification covariate, and that your algorithm handles the categorical regression without issue. For example, in one-hot encoding it's common to drop low-frequency groups -- which might be just the ones you care about!

3.) Pre-experiment sampling. This is a technique used by companies like Statsig to identify "salts" for use in a hashing algorithm which deliver balanced results. By simulating different salts and using a modified chi-squared technique, you can identify a balanced randomization that yields stratified populations.

It's recommended to use CUPED in conjunction with one of the other solutions to guarantee a fair split. By correctly using one of these methods, you can ensure that your A/B testing is both efficient and effective, providing reliable insights into user behavior and preferences.

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.
request a demo cta image

Build fast?

Subscribe to Scaling Down: Our newsletter on building at startup-speed.

Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy