Date of slack thread: 6/5/24
Anonymous: Is statsig down? Not able to do config pulls
Jiakan Wang (Statsig): We noticed some issue in one of our region (US-East) too and the team is investigating, stay tuned
Anonymous: Any updates?
Jiakan Wang (Statsig): The team is still actively working on mitigation. Please refer to our status page for updates; we’ll update on the latest incident when there is more info to share
Anonymous: Thank you for the updates! With that said, this is still a bit frustrating…
Sean Powers: Also would like an updated ETA on this if possible
Jiakan Wang (Statsig): Understand the frustration. For transparency, the process usually goes like this:
Regarding the plan moving forward, we understand that this week we had 2 incidents now. Every time we have a SEV like this, we do a very thorough SEV review to identify areas of gaps for detection and prevention to improve things for the future, and we will do the same here too.
Anonymous: We had a big marketing push on our end and most people going through the app are seeing a high error rate. This is kind of a big deal for us and we got absolutely obliterated today!
Jiakan Wang (Statsig): Sorry about that! It is not a typical issue this time around and the team is still working hard on a resolution. Once the dust settles, I’ll share a couple patterns we recommend our other customers use to be defensive against situations like this so that your app/site would not be affected.
Jiakan Wang (Statsig): It looks like things have recovered. Let me know if you are not seeing the same; (Just got a chance to sit down and write this, sorry about the delay)
This incident only affected our client-side SDK’s /initialize
endpoint, which allows your client application to fetch a user’s assignment for every feature gate and experiment at the beginning of the session.
The naive implementation of the SDK has 2 layers of safety mechanism built into the SDK - cached values and default values. Essentially if the SDK is not able to fetch the freshest values from our backend through the /initialize
endpoint, then the request would time out, and the SDK will return you values cached in the device’s/browser’s local storage from a previous session; if the cache doesn’t exist (for new users), then the SDK will fallback to the default value, which is always false
for gates, or a hardcoded value for experiment parameters that the SDK requires you to write when using the API, e.g. getExperiment('my_experiment').get('button_color', 'default_color')
.
So even the worst case, your clients should be “okay” with these couple features. Here are some more general tips.
If you are not okay with the client returning cached or default values in outages like this, then we recommend that you bootstrap your client SDK with a proxy server, which can run a Statsig server SDK to construct and serve the same response that our /initialize
endpoint would serve to your client. This way your clients are completely not dependent on our server endpoints, and your proxy server can act as a Statsig server effectively, which doesn’t need to be constantly talking to Statsig backend to operate. Read more about this here.
Anonymous: This is awesome thank you!!
Anonymous: I feel like it wasn’t entirely clear to us what the initialization code was doing. Essentially, we can build a self-hosted cache that sits between Statsig and us? This way in case of an outage, we would depend on the last fetched value?
Jiakan Wang (Statsig): Essentially yes. The server SDK has a copy of experiment/gate configurations, which allow them to independently evaluate any given users’ assignment for all of your entities locally in a very scalable way. If your client gets values served this way, worst case when Statsig is down, your server SDKs will still be running just fine, albeit using a slightly stale set of configs until Statsig is back up.
Anonymous: This is a very acceptable compromise.