Slack’s 2021 Outage Postmortem Takeaways
Slack had an outage at 4 January 2021 for about ~1 hour on their web application (they mentioned it as “tier” but I am assuming that it relates to their client-facing application). This Slack blog post discusses about how the problem was discovered (it also had a neat timeline diagram) and the cause of the problem. This is my takeaway of Slack’s start of year 2021 outage. Monitoring services should be made reliable. One major hinderance that Slack faced during the incident was the monitoring services were unable to be used. As a result, they reverted to manual ways (such as querying directly to their backends) to troubleshoot and recover the impacted systems. Balance between upscaling and downscaling of instances. The article does not go deep dive about this but it seems this scaling have a major part. It only mentioned about disabling downscaling and adding 1200 servers for the upscaling. Now, from what I know, these scalings are usually handled by the cloud provider. My best gu...