As a vendor of experimentation platforms, we have observed that in the tech industry, the success of an experimentation program depends less on the sophistication of the tests and more on the trustworthiness and scalability of the program.
Sophisticated tests can undermine the overall trust in an experimentation program by introducing more degrees of freedom, thereby increasing the risk of p-hacking. This issue is exacerbated by incentive structures that reward individuals for conducting tests with significant results.
In most tech companies, the experimentation system is easy to start but difficult to scale due to information and managerial complexities. Because these factors are less tangible, they are often overlooked, leading to wasted resources invested in an unscalable system. In such cases, the cost of maintaining more experiments increases super-linearly, while the benefits increase sub-linearly.
We serve thousands of companies with over 2 billion end users per month. In this paper, we distill our learnings and lessons into three key technical insights to ensure the scalability of experimentation systems by addressing information overload and managerial complexity through thoughtful system design.
For any system to be scalable, the cost of operation must increase sub-linearly with scale. With modern advancements in databases, compute, and storage, tangible costs are not a primary concern for most experimentation systems. However, two intangible factors limit the scalability of these systems: information overload and managerial complexity.
Experimentation generates a vast amount of information, which typically increases polynomially:
Parameters: The parameter space correlates with the number of experiments, metrics, variants, and user segments. These dimensions all expand rapidly as experiments become more complex.
Historical Relevance: Experiments serve both decision-making and learning purposes, requiring a comprehensive understanding of both current and past experiments.
System Chaos: The rigorous process of experimentation involves many potential pitfalls, including sample ratio mismatches, multiple comparisons, peeking, network effects, underpowered studies, logging errors, and pipeline mistakes.
If companies do not have a system to process and synthesize this information, they often rely on personnel to manage the complexity, which is an inherently unscalable solution.
Managerial overhead is even less tangible than information, yet it cannot be ignored:
Most engineers and product managers lack the statistical knowledge necessary to interpret experimental results and make correct inferences from observed effects to true effects (Cunningham, 2023).
Managerial incentives often encourage detrimental behaviors, such as p-hacking.
Experiments may result in technical debt by leaving configurations within the codebase.
While these challenges are solvable, mid-level managers typically lack the incentives to address them due to the principal-agent problem and resource constraints. These factors should be carefully considered when designing the system.
Without a well-designed system, the return on investment (ROI) for experimentation will decrease with scale because:
The marginal return of experiments increases linearly or sub-linearly with scale, as less effort is available to turn information into impact.
The marginal cost of experiments increases super-linearly with scale due to information and managerial overhead.
These two factors create a dilemma for many experimentation teams—they become victims of their own success. Fortunately, this dilemma is solvable. If the system is designed correctly from the start, the cost of running more experiments can increase sub-linearly, thereby freeing up more resources to drive impact from the results of experiments.
Through our practice, we have identified four essential insights for making experimentation scalable.
AB testing requires treatment experiences, randomization, and an Overall Evaluation Criterion (OEC). By integrating randomization and a metrics system with feature flags, we can automate these elements, enabling AB testing to be triggered automatically with each feature launch, without much additional engineering effort. There are three side benefits: 1) Engineers can self-serve experiments; 2) Unlocking low-code experiments; 3) Exposure data and logging data are both native to the experimentation process, making it much easier to observe the entire system, conduct additional analyses, and debug.
Metrics are proxies for business outcomes and will evolve as business priorities shift. However, the underlying logging data and pipelines should remain stable. Separating the definition of metrics from logging ensures that experiments can use pre-existing setups, and metrics can be adjusted easily without affecting the integrity of the logs. We will share our experimentation architecture and the directed acyclic graph (DAG) for our pipeline in our presentation.
The chain of data-to-decisions is long: logging, DAG, definitions, how people interpreted them and used them. Each step can easily generate discrepancies and misconceptions, worsening the information overload. The single source of truth, diagnoses, and context shall live in one place and be visible end-to-end.
There is often a gap between what is statistically correct and what is useful for business decisions. For example, a differential baseline between groups prior to a treatment is not statistically biased, but it is undesirable for making business decisions and usually requires resetting the test. An automated system should not only detect errors like sample ratio mismatches but also detect, flag, and mitigate "noises" such as heterogeneity effects, interaction effects, and skewed sampling. Additionally, the system should encourage best practices through the user interface (UI), such as using sequential testing to discourage peeking, enforcing hypothesis formulation before testing, offering multiple comparison correction upfront, and discouraging changes to the p-value threshold during an experiment.
We identified seven key characteristics of a scalable experimentation system:
Default-on experiments on all new features.
Define metrics once, use everywhere.
Reliable, traceable, and transparent data.
Trustworthy, practical statistics engine—no "magical" math.
Automated checks for errors (e.g., SRM) and flag warnings (e.g., differential baseline).
Intentionally layered experimentation information for product decisions.
Collaborative context around experiment results.
Beyond reducing the cost of running more experiments, systems with these characteristics enable two main outcomes:
Different roles contribute their strengths: engineers manage the system with best practices, product managers generate hypotheses, foster collaboration, and provide qualitative evidence, and data scientists focus on experiment design, review, and deeper analysis. This approach allows data scientists to concentrate on their expertise rather than overseeing every aspect of the AB testing lifecycle, enhancing the overall value of experiments.
Experimentation provides credible causal evidence, but it cannot generate returns without good ideas and good execution. By treating experimentation as a collaborative effort, the goal is to elevate the entire product development team to measure, learn, and improve, ultimately creating higher returns over time.
In addition to this abstract, we have a polished presentation that has been presented to hundreds of data scientists at various companies, helping them succeed in their experimentation efforts. We also have podcast interviews with industry practitioners providing anecdotal evidence for the points discussed here.
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾
Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾
Learn how the iconic t-test adapts to real-world A/B testing challenges and discover when alternatives might deliver better results for your experiments. Read More ⇾
See how we’re making support faster, smarter, and more personal for every user by automating what we can, and leveraging real, human help from our engineers. Read More ⇾
Marketing platforms offer basic A/B testing, but their analysis tools fall short. Here's how Statsig helps you bridge the gap and unlock deeper insights. Read More ⇾