Slack’s 2021 Outage Postmortem Takeaways
Slack had an outage at 4 January 2021 for about ~1 hour on their web application (they mentioned it as “tier” but I am assuming that it relates to their client-facing application). This Slack blog post discusses about how the problem was discovered (it also had a neat timeline diagram) and the cause of the problem. This is my takeaway of Slack’s start of year 2021 outage.
- Monitoring services should be made reliable. One major hinderance that Slack faced during the incident was the monitoring services were unable to be used. As a result, they reverted to manual ways (such as querying directly to their backends) to troubleshoot and recover the impacted systems.
- Balance between upscaling and downscaling of instances. The article does not go deep dive about this but it seems this scaling have a major part. It only mentioned about disabling downscaling and adding 1200 servers for the upscaling. Now, from what I know, these scalings are usually handled by the cloud provider. My best guess on how they add servers (or should it be instances?) is by increasing minimum (healthy) instances of the server. Not sure how they disable downscaling but I guess the minimum instance may also “disable” downscaling to undesired instance count.
- If a service is expecting to experience a high usage after a long low one, there should be some measures taken to handle this short-timed surge. In Slack’s case, their measure is to “request a preemptive upscaling of our [Slack] TGWs at the end of the next holiday season” since one of the incident causes is the slow upscaling of TGW (AWS Transit Gateways).
This is additional resources that the article mentioned that I found interesting. It may supplement the lessons learned outside of the outage’s cause.
- This is a guide on how to handle overload from Google’s SRE. There are practices on how to manage overload by artificial throttling (either on the backend or client side) when certain threshold is met. The threshold discussed is CPU usage and customer quota (the customer meant here is client application/service that uses a backend). I love this quote from the page because it embodies the general idea on how to gracefully handle overload: “At the end of the day, it's best to build clients and backends to handle resource restrictions gracefully: redirect when possible, serve degraded results when necessary, and handle resource errors transparently when all else fails.”
- Circuit breaker in programming written by Martin Fowler. From what I understand, it is like a more advanced retry mechanism for functions or processes that may fail outside of logic or the expected. It is worth exploring for those process that relies on remote resources (which is the case for this Slack outage).
Comments
Post a Comment