There are few different routes you could fundamentally take to achieve this, and they each have some of their own pros and cons:
One reasonable idea might be just to have a giant table of all of your users, and what group you would put them in for the flag. There are actually some clear upsides here.
For analysis, you already have a nice table of tests and controls to use, and making changes to a given user’s assignment is as simple as updating a row in that table. However, the latency implications of this method are severe: When you want to check a gate or an experiment, you’re performing a lookup in a database.
There’s another issue with that approach, as well. When you’re doing your analysis, you’ll be suffering from massive overexposure.
That is to say, if you have a change that only 10% of your users actually encountered, you’re still computing differences between control and test like every single user you have run into this change. This adds quite a bit of noise to your experiment, and will make getting accurate results take quite a bit longer.
Our approach to solving that issue is to only have a table of the users who actually saw the experiment. When you check an experiment, we log an “exposure” event to populate that table. Doing it at the time of check nicely ensures that there’s minimal room for differences in experiment behavior to cause differences in logging, helping prevent issues like Sample Ratio Mismatch from arising.
On the assignment piece, we have assignment be completely deterministic.
The simplest version of this you could imagine is saying, “Users who have an even ID get control, and users who have an odd ID get tested.” This obviates the need for any sort of database lookup or keeping a gigantic list in memory.
Of course, there’s the obvious pitfall with that heuristic that every experiment would have the same split, removing the randomization component that’s so critical to cogent analyses. The answer, then, is to have each experiment have its own deterministic split.
We accomplish this by generating a salt for each experiment, combining that with the given ID we’re checking the experiment for, and computing a Sha256 hash of that. Rather than a simple even-odd check on that, we calculate it modulo 10,000 to allow for finer-grained control of percentages.
Choosing which buckets to give what value is itself an interesting problem as well. For gates—where we expect rolling up and down the percentage of a feature—we want that to be as deterministic as possible, so it simply is when you have a 30% rollout, the first 30% of buckets will correspond to passing the gate.
For experiments in a layer, on the other hand—where you might be running multiple iterations of different experiences—it’s actually preferable that each time you go to 30% of users, it is a different 30% of users - so we have the buckets assigned randomly there.
Standard deviation and variance are essential for understanding data spread, evaluating probabilities, and making informed decisions. Read More ⇾
We’ve expanded our SRM debugging capabilities to allow customers to define custom user dimensions for analysis. Read More ⇾
Detect interaction effects between concurrent A/B tests with Statsig's new feature to ensure accurate experiment results and avoid misleading metric shifts. Read More ⇾
Statsig's biggest year yet: groundbreaking launches, global events, record scaling, and exciting plans for 2025. Explore our 2024 milestones and what’s next! Read More ⇾
A guide to reporting A/B test results: What are common mistakes and how can you make sure to get it right? Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾