Date of slack thread: 7/15/24
Anonymous: We noticed an issue where dynamic config is returning a silent empty response on python API. This happened multiple times in the last week, each affecting around 10mins before recovering. Is this outage logged any where?
Anonymous: Last time this happened it is around 1:10 PM today.
Jiakan Wang (Statsig): Hi <@U07CX891KJ5> - yes we had a redis outage today from 12:58pm to 1:08pm, lasted exactly 10 mins. I don’t recall similar incidents from last week though, do you have specific time range that I can take a look?
Jiakan Wang (Statsig): This is our <https://status.statsig.com/|status page> where we post about outages. We had attempted to post the update about today’s outage there but it looks like the mutation to publish had an outage itself so it failed, we are looking at ways to update the status page to reflect today’s incident right now.
Anonymous: yes, why is this not showing up on your status page? this is a serious breach of SLA. Another case happened Jul 10 17:10 PDT.
Jiakan Wang (Statsig): As I mentioned above, there was a failure on updating the status page. We just found the issue and it’s now on the incident section. I’m not aware of any outage on 7/10 around that time. Do you have more details on it suggesting that it was an extended outage vs. a transient failure?
Anonymous: we seen the exact same behavior like today, and the only explanation for that is statsig is returning empty response. Is there anything you are doing to prevent this from happening again? as a service for config control, this will cause different code path to be taken and can cause serious problems.
Jiakan Wang (Statsig): Is there anything you are doing to prevent this from happening again? For sure. For your context, today’s incident is treated as a SEV1 internally, which is the highest severity that we have. The team just finished mitigation and investigation for it, and a series of follow ups will be reviewed and implemented to prevent this. I don’t see any reported outage from last week. Let me check some loggings.
Anonymous: thank you. in comparison, my estimate(the number of logs around this this week is 7K vs 1.3K from last week) is the outage today is around 6X longer than last weeks. maybe you need to look for a smaller outage there. The reason I am confident this is the issue: we are using one config as a killswitch and the config was never changed for month, but we see the killswitch in effect today and last week for a few minutes each and recovered itself without any intervention. The only explanation is that statsig returning {} and cause us to take default.
Jiakan Wang (Statsig): Ah this is my bad, we actually did have a small outage (SEV 3) at your mentioned time last week, which only affected the US East region. I was basing off what the status page showed as incidents, but it looks like incident reporting functionality was broken since last week, which caused the last 2 incidents to not be reflected there. The most recent one we just did a manual overwrite to make it show up.
Jiakan Wang (Statsig): I’ll make sure we take status page updating as part of the SEV review process so that we make sure status reporting is done properly and issues with it are noticed.
Anonymous: wait, what was the SEV3? is this same cause?
Jiakan Wang (Statsig): FWIW, what we had today is likely the largest outage we’ve had. We are definitely taking it as serious as we can to make sure similar issues don’t occur again.
Jiakan Wang (Statsig): No completely unrelated.
Jiakan Wang (Statsig): That was what caused the 7/10 spike.
Anonymous: I don’t know which one worries me more.
Jiakan Wang (Statsig): We in general do have very high reliability, and are continue working on improving cases as incidents like this happen. Do note that even with today’s outage, the uptime for server endpoint is still at 99.9% for the past 30 days. That aside, incidents will always happen, and it’s important that you code defensively against it. There are a couple mechanisms that our customers use to ensure that their apps can operate just fine even when we are having an incident like this. What SDKs are you using? Check out this https://docs.statsig.com/guides/uptime|guide, specifically client SDK bootstrapping mentioned in #7 if you are using our client SDK directly on your client, and data adapter in #8 if you are using our server SDKs.
Jiakan Wang (Statsig): These 2 will make sure that your client and servers are resillient of a Statsig outage, and can operate using cached value until Statsig comes back online during an outage, even for newly spawned up servers and clients.
Anonymous: we do have cache, but the problem is long enough to have the cache expired. We are going to add some placeholder value in the config to distinguish empty vs outage, but this is still a problem. Downtime/incidents happens, but fail silently is what concerns me.
Jiakan Wang (Statsig): Please take a look at what I suggested above - it’s not a simple time based cache for your code, but rather internal to the SDK that will keep the SDK functional and return the last known good value until it receives updates. This will not expire in situations like this.
Jiakan Wang (Statsig): Most of our customers who’ve implemented this did not have noticeable impact during the outage today due to safeguards like this in place.
Anonymous: sounds good.
Jiakan Wang (Statsig): btw the SDK should not go from returning the correct value - return empty value if it’s already up and running at the time of the outage. Only newly initialized SDKs who’ve not had a correct value would fetch empty value and then return that when checked, because it just doesn’t know anything else. The suggested bootstrapping method handles this by bootstrapping all newly initialized SDK instances with known good values from a centralized data store.
Anonymous: ok, we are running this in aws lambda, which may explain the missing data. bootstrapping maybe a good idea, we will look into this.
Jiakan Wang (Statsig): I see, so you are using a server SDK in the lambda? Initializing server SDKs in lambda is gotta be slow, so bootstrapping using a JSON, which is what the SDK gets from our backend, is going to be good for both reliability and performance. Do you know if they have something like the <https://docs.statsig.com/integrations/cloudflare|cloudflare KV store> where you can store the JSON from statsig?
Anonymous: thank you. i think we can cache the value at CI time. do we have a python api to set the initial value? I noticed get_client_initialize_response get a structure that is not what bootstrap_values is expecting in statsig.initialize.
import json
with open("bootstrap_values.json", "w+") as f:
f.write(json.dumps(bootstrap_values))```
``` bootstrap_values = None
try:
bootstrap_values = open("bootstrap_values.json", "r").read()
except Exception as e:
print("Bootstrap values not exist")
pass
options = StatsigOptions(tier=tier, bootstrap_values=bootstrap_values)
statsig.initialize(sdk_key, options)```
**Anonymous:** for example in the json returned, $.feature_gates is a dict but the initialize function is expecting a list.
**Jiakan Wang (Statsig):** get_client_initialize_response is producing a JSON to bootstrap a CLIENT SDK, e.g. JS or iOS SDK, so that your client SDKs don't need to talk to Statsig backend directly. This is not meant to bootstrap server SDKs. For server SDK bootstrapping, look at the data adapter part.
**Jiakan Wang (Statsig):** <https://docs.statsig.com/server/concepts/data_store>
**Anonymous:** i don't see any document about how to get the initial value to use for the data store.
**Jiakan Wang (Statsig):** Ah that’s right, the documentation here is a bit lacking. The default way is that the SDK internally will fetch the value from Statsig backend by default periodically, and write into the data adapter to keep it updated. During initialization of a SDK, it will try to initialize from the value form the data adapter if it’s available.
**Jiakan Wang (Statsig):** You can also make a request to our endpoint directly to see what the value looks like. This is the endpoint our SDK makes to keep its internal value fresh.
```curl --location --request POST '<https://api.statsig.com/v1/download_config_specs>' \
--header 'STATSIG-API-KEY: <YOUR-SERVER-SECRET>' \
--header 'sinceTime: 0'```
**Jiakan Wang (Statsig):** Another option we have is the <https://docs.statsig.com/server/concepts/forward_proxy|Forward Proxy>, which can be used by itself, or with data adapter to provide 2 layers of safeguards. This feature is currently in beta.
**Anonymous:** thanks.
**Anonymous:** ok, one last issue: is there a way to trigger a network update immediately after the datastore based bootstrap is loaded. This way we can get the latest update asap while having the cached value as a good fallback? Right now if i use the datastore, I need to wait for a sync (10s by default) before the latest config from network is loaded.
**Jiakan Wang (Statsig):** Reading from the code, I think it does immediately try to fetch an updated value from the network right loading from the store. I may have read it wrong though but have you tested and see it waiting for 10s before doing that?
**Anonymous:** yes i can confirm i have to wait more than 10 sec for it to load. wait for 1,3,7 seconds doesn't get the latest data. i end up initialize the sdk with no cache data first, check if it is working by some remote value, and if not, re-initialize the sdk with cached data. this is the only way i get it to work the correct way. you guys need to improve you sdk and document.
**Jiakan Wang (Statsig):** Wait I don’t think you are using it the way it’s intended for. The data store is supposed to always have the latest value if your setup involves multiple instances of the SDKs connecting to it, each one is updating the value in it so that a new one will get fresh value at startup - no need to wait for network value in practice. The forward proxy option will keep the value refreshed itself by periodically calling our endpoint directly.
**Jiakan Wang (Statsig):** if you are testing locally with just one instance, you aren’t gonna have ready to use value that’s fresh.
**Anonymous:** i tested by saving and cache, change the value (of config A) on the platform, and initialize the sdk, wait(X) seconds, then get the value of config A. with X < 10, the value i get is the cached value before i change, X > 10, i get the updated value.
**Anonymous:** if you look at the sdk code, the first fetch from network(if initilized by datastore, and dataset have cache), is in a thread loop for update every 10 seconds, but it first wait 10 before start the first fetch.
**Jiakan Wang (Statsig):** yes I understand how this work. This is because you don’t have something else that’s updating the cache while your SDK is not running.
**Jiakan Wang (Statsig):** The data store / adapter model should have 1 data store that’s serving the value to N SDKs, and also all the SDKs will update the value in the data store periodically, so whenever you start a new SDK, the value it receives from the data store is fresh enough.
**Anonymous:** i see.
**Jiakan Wang (Statsig):** It may still have a few seconds of delay, depending on how many lambdas are being run at a given time for your specific service.
**Anonymous:** hmm, so for a case when AWS lambda starts, it will be the only instance of SDK and if load from cache, it will not get refreshed until 10s late. "cache" as in a local cache file. just saying, a force refresh api will come really handy lol.
**Jiakan Wang (Statsig):** ok, so you don’t have a way to have a shared data store somewhere?
**Jiakan Wang (Statsig):** That’s not very different than just initializing the SDK the normal way, isn’t it?
**Anonymous:** that will depend on another service, another layer of failure point. what i need is very simple, initialize the SDK, and if cannot reach statsig server/server has issue, fallback to a local cache/bootstrap data.
**Jiakan Wang (Statsig):** that will depend on another service, another layer of failure point It’s a very simple cache layer that just stores a JSON, so it’s less likely to have any issue. Even if it does have an issue, the SDK falls back to initializing normally, so worst case this is still not worse than not using it.
**Jiakan Wang (Statsig):** what i need is very simple, initialize the SDK, and if cannot reach statsig server/server has issue, fallback to a local cache/bootstrap data. Do you know if the local cache for aws lambda is even going to be kept around for long enough for this to work?
**Anonymous:** the local cache is build at CI time.
**Anonymous:** but i see what you mean here.
**Jiakan Wang (Statsig):** If your lambda’s lifecycle is very short, you can configure the `rulesets_sync_interval` option to be something much shorter than the default 10s, so the SDK will try to check network response much sooner than 10s after the initial load of the cache value. The default is 10.