Designing for failure

Sat Dec 18 2021

Vineeth Madhusudanan

Product Manager, Statsig

How Statsig stays up

Statsig serves billions of individual user interactions. Along the way, we designed the service for reliability and availability of your apps that use Statsig. Because of this, in the case where your application cannot reach Statsig for any reason, your application will continue to work exactly as you expect with locally cached values. Read on for how we make this possible.

Server SDKs

How do the Server SDKs return results instantly? When you initialize a Statsig SDK on your server, the SDK reaches out to Statsig and retrieves definitions for your feature flags, experiments and dynamic configs. Every subsequent feature gate, experiment or dynamic config check is processed locally on your server. The response times for these checks is a fraction of a millisecond. Events uploaded to Statsig from the SDK are batched and will survive transient network connectivity issues.

What happens if there is no connectivity to Statsig? If your server loses connectivity to Statsig, it’ll happily continue serving results using the cached rule set it has. When connectivity to Statsig is available, it’ll resume checking for updates to your project’s rule set.

What if I need to bootstrap a server, without connectivity to Statsig? The Statsig SDKs allow you to save the rule sets that have been downloaded to your server and use this to bootstrap servers that come up without Internet connectivity or connectivity to Statsig. When connectivity resumes, the SDKs will refresh this rule set with any changes made since it was saved. (documentation; see bootstrapValues for how to retrieve this config and rulesUpdatedCallback for how to be notified on updates to it).

Watch this 3 minute video for more context!

Client SDKs

How do the Client SDKs return results instantly? When you initialize a Statsig client SDK, the SDK reaches out to Statsig and retrieves the precomputed values of all feature gates, experiments, and dynamic configs for the current user and caches those values locally. Every subsequent feature gate, experiment or dynamic config check looks up the value in memory. The response times for these checks is a fraction of a millisecond. Events uploaded to Statsig from the SDK are batched and will survive transient network connectivity issues via retries or saving failed log event requests to local storage.

What happens if there is no connectivity to Statsig? If your client loses connectivity to Statsig, it will fall back to using cached values. If this is a new user who has not had a chance to cache any values, all SDK apis will return their default values: false for gates, empty for experiments and configs. Every experiment or dynamic config is also configured in your code with a default value that serves as a fallback.

Do we need a relay server? Some vendors provide an onsite relay or proxy to reduce load on their servers. A decade back, outbound internet connectivity was a scarce resource at companies that weren’t digital first. Today this offers low value — and is another potential point of failure to deploy, maintain and monitor. We don’t think a relay server offers value — but if there’s a problem or pain point you’re concerned about, we’d love to hear!

Server infrastructure Statsig’s infrastructure spans AWS and Azure across multiple availability zones. Most data is returned from in-memory caches, allowing typical server response times well under 50ms. Because server and client SDKs cache values and evaluate locally, your application can continue to function without having to connect to the Statsig servers, except to initialize and then to lazily log events.

To deal with increased demand, we autoscale across our cloud providers. When an availability zone fails for any reason, we seamlessly fail over to surviving availability zones.

Every time we deploy code, we fail out an availability zone to upgrade it. Making failover a core part of our deployment strategy causes this to be exercised regularly, making it very robust. Failovers that aren’t exercised frequently can become fragile. Our approach ensures this isn’t the case.

Statsig is built by builders for builders. Have a question about reliability? Reach out and ask — we’re happy to engage!

Some links to learn more— Statsig’s availability dashboard 3m video on our client and server SDKs Statsig’s security posture


Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy