Platform

Resources

Docs Blog Pricing

Platform

Resources

Platform

Resources

Frequently Asked Questions

A curated summary of the top questions asked on our Slack community, often relating to implementation, functionality, and building better products generally.

GENERAL

How can we ensure reliability and mitigate issues with Statsig's service?

Date of slack thread: 6/5/24

Anonymous: Is statsig down? Not able to do config pulls

Jiakan Wang (Statsig): We noticed some issue in one of our region (US-East) too and the team is investigating, stay tuned

Anonymous: Any updates?

Jiakan Wang (Statsig): The team is still actively working on mitigation. Please refer to our status page for updates; we’ll update on the latest incident when there is more info to share

Anonymous: Thank you for the updates! With that said, this is still a bit frustrating…

It looks like the incident report updated late and was backdated (published ~15 mins after the incident started)
What is the plan moving forward to prevent these incidents? We are currently relying on statsig as a core component and may have to move away from this based on reliability as of lately; is there an update here? We’re still seeing a ton of issues on our end

Sean Powers: Also would like an updated ETA on this if possible

Jiakan Wang (Statsig): Understand the frustration. For transparency, the process usually goes like this:

An incident happens;
Depending on the severity, it may trigger an alert right away, or take some time to ramp up for alerts to trigger;
The infra oncall receives the alert and takes some steps to confirm whether it’s an actual outage or false alarm;
Once confirmed, we create a SEV internally to track and discuss investigation/mitigations;
- During this stage, someone will update the status page to reflect the incident, but it takes at least a few minutes to get to this stage. And sometimes because the team is focused on the investigation, there is a bit more lag here. We are trying to get into the habit of updating the status page more diligently though.

Regarding the plan moving forward, we understand that this week we had 2 incidents now. Every time we have a SEV like this, we do a very thorough SEV review to identify areas of gaps for detection and prevention to improve things for the future, and we will do the same here too.

Anonymous: We had a big marketing push on our end and most people going through the app are seeing a high error rate. This is kind of a big deal for us and we got absolutely obliterated today!

Jiakan Wang (Statsig): Sorry about that! It is not a typical issue this time around and the team is still working hard on a resolution. Once the dust settles, I’ll share a couple patterns we recommend our other customers use to be defensive against situations like this so that your app/site would not be affected.

Jiakan Wang (Statsig): It looks like things have recovered. Let me know if you are not seeing the same; (Just got a chance to sit down and write this, sorry about the delay)

This incident only affected our client-side SDK’s /initialize endpoint, which allows your client application to fetch a user’s assignment for every feature gate and experiment at the beginning of the session.

The naive implementation of the SDK has 2 layers of safety mechanism built into the SDK - cached values and default values. Essentially if the SDK is not able to fetch the freshest values from our backend through the /initialize endpoint, then the request would time out, and the SDK will return you values cached in the device’s/browser’s local storage from a previous session; if the cache doesn’t exist (for new users), then the SDK will fallback to the default value, which is always false for gates, or a hardcoded value for experiment parameters that the SDK requires you to write when using the API, e.g. getExperiment('my_experiment').get('button_color', 'default_color').

So even the worst case, your clients should be “okay” with these couple features. Here are some more general tips.

If you are not okay with the client returning cached or default values in outages like this, then we recommend that you bootstrap your client SDK with a proxy server, which can run a Statsig server SDK to construct and serve the same response that our /initialize endpoint would serve to your client. This way your clients are completely not dependent on our server endpoints, and your proxy server can act as a Statsig server effectively, which doesn’t need to be constantly talking to Statsig backend to operate. Read more about this here.

Anonymous: This is awesome thank you!!

Anonymous: I feel like it wasn’t entirely clear to us what the initialization code was doing. Essentially, we can build a self-hosted cache that sits between Statsig and us? This way in case of an outage, we would depend on the last fetched value?

Jiakan Wang (Statsig): Essentially yes. The server SDK has a copy of experiment/gate configurations, which allow them to independently evaluate any given users’ assignment for all of your entities locally in a very scalable way. If your client gets values served this way, worst case when Statsig is down, your server SDKs will still be running just fine, albeit using a slightly stale set of configs until Statsig is back up.

Anonymous: This is a very acceptable compromise.

Join the #1 experimentation community

Connect with like-minded product leaders, data scientists, and engineers to share the latest in product experimentation.

Join Community

Try Statsig Today

Get started for free. Add your whole team!

Why the best build with us

Testimonials

At OpenAI, we want to iterate as fast as possible. Statsig enables us to grow, scale, and learn efficiently. Integrating experimentation with product analytics and feature flagging has been crucial for quickly understanding and addressing our users' top priorities.

OpenAI

Dave Cummings

Engineering Manager, ChatGPT

More stories

Brex's mission is to help businesses move fast. Statsig is now helping our engineers move fast. It has been a game changer to automate the manual lift typical to running experiments and has helped product teams ship the right features to their users quickly.

Brex

Karandeep Anand

President

More stories

At Notion, we're continuously learning what our users value and want every team to run experiments to learn more. It’s also critical to maintain speed as a habit. Statsig's experimentation platform enables both this speed and learning for us.

Notion

Mengying Li

Data Science Manager

More stories

We evaluated Optimizely, LaunchDarkly, Split, and Eppo, but ultimately selected Statsig due to its comprehensive end-to-end integration. We wanted a complete solution rather than a partial one, including everything from the stats engine to data ingestion.

SoundCloud

Don Browning

SVP, Data & Platform Engineering

More stories

We only had so many analysts. Statsig provided the necessary tools to remove the bottleneck. I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig.

Ancestry

Partha Sarathi

Director of Engineering

More stories

We use cookies to ensure you get the best experience on our website.