Gitlab’s 2017 Database Outage Postmortem Takeaways
The complete article of the postmortem can to read here: Postmortem of database outage of January 31. In summary, Gitlab.com has a database outage occurred that on 31 January 2017. This outage resulted in the lost of data spans over +6 hours.
There are several takeaways that I learned from this postmortem.
- Setup several recovery mechanism that works in case the first choice cannot be relied. Gitlab have 3 ways of database backup: running scheduled
pg_dump
to S3, disk snapshots using Azure, and LVM snapshot that usually used to copy data from production to staging environment.- The
pg_dump
cronjob failed because of a different version used forpg_dump
(for PostgreSQL 9.2) and the database version (PostgreSQL 9.6). - The Azure disk snapshot is used to only backup NFS server (I am guessing data for their application and others..?). Moreover, this is a disk recovery data which means if Gitlab needed to only restore database data. I imagine they should manually to choose which part of the disk contains the database backup data if they use this restoration method.
- LVM snapshot is the recovery mechanism that Gitlab chose for this incident. However, since they were transferring backup snapshot from staging (in Azure classic storage which has low network throughput) to production, it took them 18 hours to restore it along with webhooks that are not in staging database but are in production one.
- The
- Recovery mechanism (and perhaps the procedure too) should be tested regularly. Gitlab only knew their recovery mechanisms are not robust as they thought are until the disaster already happened. The mechanism and the procedure should also be done in a safe environment in anticipation of human mistakes from engineers (which most likely going to happen, increasingly so in stressed time).
- Error reporting should be ensured that it has a way to also alert reports that failed to be announced (whether failure on sending email or messaging application). Gitlab’s error reporting relied on email with DMARC, a mechanism to validate the trustworthiness of an email sender. However, the error reporting for cronjob does not use DMARC which results in the report being rejected as an untrusted sender.
- User deletion should be soft deletion. I am guessing their hard deletion on the mistakenly flagged engineer user may cause foreign key errors on PostgreSQL which contributes to the lag they were facing, on top of the increased spams.
Comments
Post a Comment