Posts

Showing posts from 2022

Slack’s 2021 Outage Postmortem Takeaways

Slack had an outage at 4 January 2021 for about ~1 hour on their web application (they mentioned it as “tier” but I am assuming that it relates to their client-facing application). This Slack blog post discusses about how the problem was discovered (it also had a neat timeline diagram) and the cause of the problem. This is my takeaway of Slack’s start of year 2021 outage. Monitoring services should be made reliable. One major hinderance that Slack faced during the incident was the monitoring services were unable to be used. As a result, they reverted to manual ways (such as querying directly to their backends) to troubleshoot and recover the impacted systems. Balance between upscaling and downscaling of instances. The article does not go deep dive about this but it seems this scaling have a major part. It only mentioned about disabling downscaling and adding 1200 servers for the upscaling. Now, from what I know, these scalings are usually handled by the cloud provider. My best gu...

Gitlab’s 2017 Database Outage Postmortem Takeaways

The complete article of the postmortem can to read here: Postmortem of database outage of January 31 . In summary, Gitlab.com has a database outage occurred that on 31 January 2017. This outage resulted in the lost of data spans over +6 hours. There are several takeaways that I learned from this postmortem. Setup several recovery mechanism that works in case the first choice cannot be relied . Gitlab have 3 ways of database backup: running scheduled pg_dump to S3, disk snapshots using Azure, and LVM snapshot that usually used to copy data from production to staging environment. The pg_dump cronjob failed because of a different version used for pg_dump (for PostgreSQL 9.2) and the database version (PostgreSQL 9.6). The Azure disk snapshot is used to only backup NFS server (I am guessing data for their application and others..?). Moreover, this is a disk recovery data which means if Gitlab needed to only restore database data. I imagine they should manually to choose which part of...