Post Mortem New Year's Eve!
February 09, 2023 •
What Happened?
On 31st December 2022 near midnight, our production suffered an outage. It was caused by corruption in our PostgreSQL database. The reason why it occurred in the first place was unclear.
All attempts to start the Docker Compose instance failed. I had a backup of the production environment, or so I thought. Upon attempting restoration, I found that it contained empty databases. The backups appeared successful but lacked any data.
The issue was traced to multiple Postgres instances initiated by Docker Compose, competing with/for the actual database. The backup script ran without triggering an error but produced empty results due to backing up an empty instance.
How We Recover Partially
I recovered using a 1.5-month-old backup that still had some of the relevant data.
I recovered additional information using our logs and got the database running again. It wasn't optimal, as the data was corrupted. Scripts and manual data insertion were used to bring the database back to its complete state.
Process Changes
I’ve taken some precautions and put some procedures in place to mitigate the risk of such an incident occurring again.
The production environment and code were in a pretty cluttered state. This is due to over six years of development and infrequent refactoring.
In January, efforts were made to simplify and streamline the code and production environment.
The system now runs on bare metals instead of Docker Compose, and the number of dependencies has been slashed to a minimum.
Such steps have been taken before us, and we sure won’t be the last on the quest to reduce complexity:
"They're rebuilding the Death Star of complexity"
This should hopefully prevent the incident from recurring.
A Python Fabric Script has been created and can be run with one command to reset the staging environment with the latest backup. In a nutshell, this is what it does:
The script retrieves the backup from a backup server (located on a different continent) Removes sensitive information, Updates developer and tester accounts to admin status Overrides the staging environment database Prints out the latest occurring events. Positive outcomes of this measure:
Ensures that the staging environment remains up-to-date
The backups are working correctly. If an error is triggered, we know our database backup might not be complete.
Checking the latest events upon detecting an error can reveal a problem and see if any data is recovering from an old backup.
Refactoring efforts have also been prioritized for the future. In the past 5-6 years, refactoring was not given the necessary importance, but in the future, at least a week of refactoring semi-annually is planned and scheduled in the team lead's calendar.
Moving Forward
I plan to improve our image backup process. We have hundreds of gigabytes of images backed up, but they aren’t subjected to a routine check.
I will create a script that compares the images in the production and backup systems to ensure they're up to date and there are no discrepancies. This process is expected to take a significant amount of time. It involves a large amount of data.
If you found this interesting, you likely will like How We Do Software.