Your go-to guide for Online Bot Filtering

Tue Aug 06 2024

Michael Makris

Senior Data Scientist, Statsig

As a Data Scientist, my ideal data requirements will include:

  • the most accurate and reliable data for your feature gate rollouts and experiments.

  • knowing how many users have seen your new home screen design and its effects on sales.

  • knowing if a new code deployment is increasing crash rates and hurting your business.

Digital bots and web crawlers can make all of this more difficult. And yet we need them for the modern internet and AI to work. This led us to rollout bot filtering by default on Statsig.

How bots can affect you

Bots can skew all sorts of results in experiments and analytics by inflating exposure counts and introducing noise. While investigating this issue we found that, depending on the company, bots can be responsible for anywhere from 0% to 50% of their raw exposures. In the worst cases we found that up to 80% of a company’s unique units (e.g. Users, Sessions, Devices) in feature rollouts and experiments could be bots, although most companies see fewer.

Any experiment with high bot participation should still report metrics’ relative deltas (ΔX̄%) correctly since any bots should still be split into variants proportionally like everyone else. If bots bias a metric, we expect the bias to be equally present in both treatment and control; however, the metric’s absolute values and confidence interval will change as a result.

Imagine… your +8% ± 10.0% improvement in revenue per visitor was really +8% ± 5%. That could be the difference between a non-statsig and statsig result! What changed was the number of users not contributing to revenue and adding noise (i.e. variance) to your results.

Crawling bots tend to add users to your experiments that don’t contribute to your success metrics. They don’t buy goods, download software, or sign up for classes. This can dilute your metrics in all variant groups and degrade your experimental power.

Example: Home Screen Revenue Experiment

Let’s walk through a simple example in detail to understand how bots affect experiment results.

Your company runs a home screen experiment and you want to measure average revenue per visitor. We’ll assume we know the ground truth real outcomes, and then see how our ability to detect improvements degrades when we run a simulation with and without bots.

  • Let’s say that real human visitors to our website make purchases 50% of the time while bots never purchase anything. When visitors do purchase something the price is normally distributed with a mean of $10 and a standard deviation of $5, with a minimum floor of $0.

  • We want to test a new homepage variant and now the average purchase price increases 5% to $10.50, and the standard deviation remains $5.

These example conditions make it easy to simulate what happens with and without bots. When visitors and purchase prices perfectly follow the above rules and distributions, our simulations produce the following results:

With 50% of Users as Bots Bots Removed
Control Total Visits = 20k
Total Revenue = $50211

Metric Mean = 2.51
Metric Stddev = 24.91
Total Visits = 10k
Total Revenue = $50211

Metric Mean = 5.02
Metric Variance = 37.22
Treatment Total Visits = 20k
Total Revenue = $52664

Metric Mean = 2.63
Metric Stddev = 26.85
Total Visits = 10k
Total Revenue = $52664

Metric Mean = 5.26
Metric Variance = 39.84
Conclusion Metric: +4.89% ± 5.62%
p-value: 0.0883
Metric: +4.89% ± 3.43%
p-value: 0.0052

Removing bots from experiments didn’t meaningfully change the deltas. It did, however, make real improvements to the confidence intervals, helping our example experiment go from non-statsig to statsig.

The meaningful results we saw in the previous example reflects what Statsig sees in general with Bot Filtering: the more bots and users differ in behavior (e.g bots never buy anything but your real customers do), the more bots can seriously affect the results you see. You don’t have to worry about bot filtering changing your core metric results: metric deltas will stay the same. However, removing these bots may have real benefits to your experimental power and sensitivity. This should help you move faster and make decisions with less data.

Example: Simulating More Experiments with Noise

Our prior example assumed perfect statistical trends: the mean and standard deviations of the human users were perfectly fit. What happens if we run more realistic simulations where purchases are drawn randomly from distributions with random noise? We ran 10k simulations to find out.

We computed the width of the confidence interval and magnitude of p-value over 10k simulations, with and without bots:

Bot Filtering Image 1

Here we see the real improvements between the bot and no-bots simulations. When bots really behave differently from real users, we saw substantial improvements. Confidence intervals widths median decrease was 13.7%, and p-value median decrease was over 66%. All in this led to an increase in the percent of the simulations that found a statsig result from 71% to 81%.

Who sees more bots

The biggest predictor of bot traffic tends to be the environment that generates it. Bot traffic that comes to Statsig depends on the SDKs that our customers use. Some SDKs tend to be used client-side on websites that bots are freely able to crawl, while others are used on server-side.

One assumption we had when starting was that Bots would pretty much only be found in client SDK traffic given it’s accessibility to them; however, as we saw in the data, there is a large amount of bot traffic that gets processed through servers as well without getting filtered out first.

SDK Bot Exposures

When we dug into the bot traffic by SDK type further, we saw that the source company made a large difference. When companies use server SDKs but are processing web sessions, we generally see more more bot traffic given the accessibility of bots to access web pages. When companies use server SDKs behind a login window, we see far fewer bots given the natural barriers this poses to bots. Unsurprisingly, client SDKs see the highest bot traffic; however, even here some companies have done more work to avoid logging these bots than others.

Given all these variables that go into knowing if your traffic is accessible to bots, our bot filtering will be applied to clean up analytics for all Statsig SDKs, regardless of source.

How we filter bots

When we looked at the user-agent strings Statsig customers logged through our SDKs, we found that many of the biggest bots out there were already self-reporting in their browser_name. “Googlebot”, “FacebookBot”, and “TwitterBot” were all naming themselves, along with more than 300 others.

We also tested a major industry package that uses IP address to identify bots, but we found that it missed the vast majority of these bots that were already naming themselves. We decided that for our initial launch of bot filtering, the more direct approach was better.

We decided to filter out bot traffic based on the browser_name of exposure events. This simple change involves comparing browser names against an indexed list of bots that identify themselves as bots. By excluding these bots from our data pipelines, we can provide cleaner and more accurate data for analysis. We will be maintaining our list of bot names on an ongoing basis, ensuring that any new bots or bots we missed don’t start polluting your results again.

We are implementing this filtering at multiple stages in our data pipelines, ensuring that bot traffic is removed as early as possible. This not only improves data accuracy but also reduces storage and compute costs, which we can pass along to our customers.

How do customers use this feature

Customers can benefit from this feature without any additional effort. Bot filtering will be applied automatically to all exposures, improving the accuracy of rollouts and their metrics. For those who prefer not to use this feature, Statsig offers an opt-out option available in the console under your project settings.

Controlling which features bots see

The most common response we’ve gotten when sharing this feature with customers has been Take My Money! After that, the second most common response has been to request control over which features and experiment variants bot receive. For example, you might be rolling out a new look for your home page but you don’t want search engines to index it yet in case you roll it back.

Thankfully, this is totally doable on Statsig. Using Segments, you can define a rule that identifies bots according to their browser names (just like we do). This global Segment can then be used on your features and experiments to control exposures. You can find more detailed steps for how to do this in our docs pages here.

Billing changes

Statsig believes in being transparent and fair with our customers. As part of this change, any bot events dropped will not contribute to your billable events. This means that customers will not be charged for bot-generated exposures, leading to potential cost savings depending on their bot traffic.

Create a free account

You're invited to create a free Statsig account! Get started today with 2M free events. No credit card required, of course.
an enter key that says "free account"

Build fast?

Subscribe to Scaling Down: Our newsletter on building at startup-speed.

Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy