Traditional frequentist approaches, particularly null-hypothesis significance testing (NHST), dominate A/B testing but come with well-known challenges such as “peeking” at interim data, misinterpretation of p-values, and difficulties handling multiple comparisons. Bayesian alternatives promise more intuitive “probability of being better” metrics and flexible sequential monitoring. In particular, “informed Bayesian” methods incorporate prior knowledge about effect sizes to theoretically speed up decision-making and shrink estimates toward realistic values.
This paper distinguishes two broad ways in which “informedness” can be brought into Bayesian A/B testing:
Adjusting the Point Estimate: One approach is to specify a prior that shifts the posterior effect size toward a particular assumption, such as optimism or skepticism. This may be done to reflect historical data, domain expertise, or other external sources of knowledge.
Tightening the Confidence (Credible) Interval: Alternatively, one can choose a narrower prior that reduces uncertainty in the posterior distribution. This approach can be aligned with methods that more strictly control error rates (e.g., controlling the False Discovery Rate), thus incorporating additional knowledge to make the resulting intervals more precise.
However, despite the theoretical appeals, the implementation of informed Bayesian approaches can vary widely in practice, presenting several practical challenges. For instance,
The choice of priors can strongly influence the resulting posterior estimates, requiring careful calibration to avoid unintentionally skewing the analysis.
Narrower priors, while reducing uncertainty, may necessitate larger sample sizes or longer testing periods. In organizations that rely on rapid experimentation, these demands on time and resources can be substantial.
The need for expertise in Bayesian statistics, along with transparent communication of how priors are chosen and interpreted, further emphasizes that informed Bayesian methods must be applied thoughtfully to balance the benefits of incorporating prior information with the realities of data-driven decision-making.
Therefore, in applications, we should be careful about the misuse of Bayesian without understanding their misuse:
Neither type of informed Bayesian approach is “wrong” in principle, but the first introduces a greater risk of data manipulation, while the second can slow down decision-making.
In many cases, the second approach is effectively equivalent to applying FDR-type frequentist adjustments and often yields the same outcomes, just framed in Bayesian terms.
From an organizational perspective, the real danger is that the label “Bayesian” can be used to disguise manipulations of the data analysis process or to impose unnecessary overhead that discourages teams from testing.
Our goal is to clarify these trade-offs and suggest ways to harness Bayesian methods responsibly. For such purposes, we recommend two approaches:
We will soon launch informed Bayesian as an advanced tool in our product. We will provide access control for the center of excellence team that are responsible for the health of the experimentation program. Use it cautiously.
Tom Cunningham’s approach of reporting the raw estimates, benchmark statistics, and idiosyncratic details.
As Tom argued in his article:Benchmarking solves all the problems (inferencing from observed effect to true effect) above. An empirical-Bayes shrunk estimate represents our best guess at the true treatment effect conditional on the experiment being drawn from a given reference class.
Frequentist NHST uses p-values to check if observed data significantly depart from a null hypothesis. A p-value below α=0.05 is widely used as evidence that a treatment is better than control. Such tests are susceptible to misuse, however—early stopping (“peeking”) inflates Type I error, and p-values can be misread as the probability of a hypothesis being true.
Bayesian methods, by contrast, update a prior distribution into a posterior distribution based on observed data. One can then calculate the probability that a treatment effect is positive or negative, or exceed a specified threshold. The approach handles interim looks more gracefully, under certain conditions. Crucially, though, once an “informed” (i.e., nonuniform) prior is introduced, Bayesian results can shift toward or away from significance depending on how that prior is specified.
Integrating Realistic Expectations: When carefully based on sound historical data or domain expertise, a prior that shifts the point estimate can reflect genuine knowledge about typical effect sizes. If most changes historically yield around a 2% lift, encoding that into the prior prevents overreaction to random fluctuations.
Faster Convergence on True Effects: If the prior is well-calibrated, the posterior may identify a truly beneficial variant with fewer samples. This can enable product teams to deploy meaningful improvements earlier, potentially gaining a competitive edge.
Better Power in Borderline Cases: When the effect size is small but real, an informed prior centered near that region can help detect these subtle improvements that might otherwise require very large sample sizes in a purely uninformative approach.
Data Manipulation Disguised as Scientific: If a senior stakeholder pushes for a strong prior favoring the treatment, teams can inadvertently confirm a desired outcome. This practice is functionally similar to “alpha inflation” but hidden behind Bayesian terminology.
Risk of Biased Adoption: Overly optimistic priors may inflate marginal posteriors, leading product teams to implement subpar changes that harm long-term outcomes. This risk is magnified when organizations lack formal processes to justify or audit priors.
Stronger Error Control: By narrowing the range of plausible effects, this approach can reduce false positives. In essence, it places a higher burden of proof on any claim of a large effect, making results more robust against random noise.
Alignment With Empirical-Bayes or FDR Frameworks: A carefully tuned prior can approximate frequentist error-control techniques like FDR adjustments. For organizations already familiar with FDR, a Bayesian analog can offer a more intuitive “probability of improvement” interpretation while maintaining rigorous thresholds.
Reduced Overconfidence in Extreme Outcomes: Tight priors naturally shrink estimates, preventing teams from overreacting to fleeting spikes or short-term anomalies. This “shrinkage” is often beneficial when multiple variants are tested and many turn out to be near-neutral.
Often Redundant With FDR If an organization already employs frequentist techniques like Benjamini–Hochberg, the Bayesian interval-tightening approach may be duplicative. Unless there is a strong preference for Bayesian interpretation, the additional complexity might not provide new benefits.
Slows Down Decisions: Narrow intervals or high posterior thresholds typically require larger sample sizes to reach conclusive evidence, especially when the prior heavily penalizes large effects. This can delay launches and discourage frequent experimentation, potentially undermining a data-driven culture.
Empirical-Bayes Priors with Historical Data: Large-scale experimentation at organizations like Microsoft Bing shows that modeling priors from historical effect-size distributions can balance false positives and false negatives. By centering the prior around “most features produce small or near-zero lifts,” the Bayesian approach becomes conservative and less prone to overclaiming a positive effect.
Hierarchical Models for Subpopulations: When tests span multiple geographies or product lines, hierarchical Bayesian methods “borrow strength” across subgroups. This exemplifies Type 2 (limiting the interval) in that each subgroup’s posterior is partially pulled toward an overall effect size, mitigating false discoveries from small-sample noise.
Structured Prior Elicitation: Some regulated settings, like clinical trials, demand rigorous prior specification before any data are seen. Likewise, a few large tech companies codify rules on how priors must be set and validated, reducing the scope for opportunistic adjustments.
Influential Stakeholders Dictating Priors (Type 1 Abuse) Suppose A VP wants to push for the launch of a feature, and insist that a known “industry best practice” yields at least a 3% lift, leading to a prior that artificially skews the posterior. Pressure to confirm this expectation can mask marginal or null effects.
Excessive Caution or Slowed Experimentation (Type 2 Overuse) Overly tight priors or high Bayesian thresholds for calling a “win” can cause valid improvements to remain “inconclusive” for too long. Teams might lose momentum or revert to shipping features by “gut feel” rather than data.
Frequentist procedures like Benjamini–Hochberg explicitly adjust p-value thresholds to keep the proportion of false positives below a target. The mathematics of empirical Bayes often mirror FDR-based logic, suggesting that a well-implemented Type 2 Bayesian approach is functionally equivalent to an FDR-corrected frequentist pipeline. Hence, the real choice may hinge on interpretive preference, organizational familiarity, and available software tooling.
“Always-valid” or sequential p-values address the same challenge of early stopping without requiring a subjective prior. Teams hesitant to rely on uncertain “expert knowledge” may prefer these purely frequentist solutions for transparent inference.
“We are ignoring prior knowledge if we don’t shift the point estimate.” Legitimate knowledge should indeed inform analysis, but it must be empirically verified rather than anecdotally claimed or leadership-imposed. Implementing a robust empirical-Bayes pipeline using real historical data is often safer than manually injecting an optimistic mean.
Tom Cunningham has a thorough argument regarding the calibrating observed effects from experiments to true effect. After examining multiple options, his advice was to report the effect with “benchmark statistics”, which we agree to.
“Tight Bayesian intervals are more scientifically rigorous, so they can’t hurt.” They can inadvertently hurt the culture of experimentation if each test now requires far more participants or time to achieve significance. Organizations must balance scientific caution with the need for actionable results. Excessively narrow priors or high Bayesian thresholds can yield a chilling effect, dissuading teams from testing.
“Surely we can spot extreme prior manipulation.” Major manipulations may be obvious, but small deviations accumulate. Teams might slightly adjust a prior multiple times, tipping marginal cases over the line. Without a transparent process, these shifts can remain undetected.
To mitigate risks in both types of informed Bayesian approaches:
Adopt Empirical-Bayes with Historical Data (When Feasible): Build priors from actual past tests—this curbs personal bias (Type 1 misuse) and provides an honest basis for controlling variance (Type 2).
Formalize Any Expert Priors Pre-Data: If domain experts truly have knowledge, document it thoroughly before seeing experiment results. Use independent reviews to sign off on distribution assumptions.
Calibrate and Simulate: For both approaches (shifting point estimates or tightening intervals), conduct simulations under realistic effect sizes. Verify that your false positive and false negative rates remain acceptable.
Maintain Transparency: Always disclose the prior parameters and how posterior results compare to frequentist metrics (e.g., p-values). This helps detect suspicious patterns or “Bayesian hacking.”
Consider Frequentist FDR or Always-Valid Methods as Complements: If you already have a robust pipeline for FDR-corrected p-values, adding a similar Bayesian version may be redundant. Conversely, teams uneasy about subjective priors might prefer purely frequentist sequential designs.
Safeguard Experiment Culture: Track how often prior-based analyses are invoked and how often they yield new decisions. If everything is “inconclusive” or suspiciously “positive,” the process may be stifling innovation or encouraging prior manipulation.
If you look for one approach that is most scientific, realistic, and robust, we recommend Tom’s approach of reporting the raw estimates along with benchmark statistics, and idiosyncratic details, then let the decision maker make a call.
Informed Bayesian methods offer two main levers in industrial A/B testing: (1) shifting the effect-size estimate via a prior, and (2) narrowing (or limiting) the credible interval. Each approach has a sound theoretical basis but carries distinct practical trade-offs.
Shifting the point estimate (Type 1) can be dangerous if an organization uses it to validate a favored outcome. A strong prior — especially one mandated by a senior stakeholder — may inflate false positives and mask weak or null results. Meanwhile, narrowing the interval (Type 2) aligns well with the spirit of controlling false discoveries but can mirror frequentist FDR adjustments in practice. It may also slow down experiments to the point that teams lose the appetite to test frequently, hindering the overall experimentation culture.
Ultimately, the question is not whether Bayesian methods are good or bad but how they are operationalized and governed. Empirical-Bayes pipelines, transparent reporting, simulations, and alignment of incentives can mitigate the downsides of both types of informed Bayesian approaches. By prioritizing these safeguards, data science teams can realize the benefits of Bayesian inference — clarity, flexibility, and a principled incorporation of prior knowledge — without sacrificing the speed or integrity of their testing culture.