Cool stories from production
Author: serge1peshcoff
Date/time: 17th August 2019 00:00 CEST - 03:00 CEST
Result: 2 databases corrupted and destroyed without the possibility of restoring and eventually restored from backup
Summary:
Notes to myself:
- if you stop and then start the db container, it might suddenly crash and corrupt all of your data without recovery (happened with statutory and discounts)
- if you didn't make a db backup before attempting something, you're fucked up (luckily I did)
- trying to restore a db from a SQL dump is quite fun
What actually happened: apparently if you stop and start the db container, you can get this in postgres logs as a result:
```PANIC: could not locate a valid checkpoint record```
after which the db refuses to work and exits with non-zero response code. The top answer for dealing with it suggest resetting the WAL log:
```pg_resetwal -f DATADIR```
(which is another challenge to find in a docker container), and once I've done this on statutory I've realized that half of my data is missing because the table didn't contain like 7 or 8 migrations applied over the last 3 months.
So the only reasonable thing I've come to is to reset the db from the SQL dump I've made earlier (after searching for an hour how to transfer it to the container from the host machine if the container is not running) and it (apparently) worked. Yay.
So statutory db was repaired, and I've no idea what's up with the discounts database but it's up and running (resetting the WAL helped and didn't destroy everything)
Then I was trying for another hour to set secrets for oms-mailer and oms-mail-transfer-agent, but eventually I did it. So now it should be all okay.
Here's the supposedly how it worked:
- I've run `make stop`, which stopped all the containers
- postgres-oms-statutory and postgres-oms-discounts both tried to write something to WAL simultaneously on shutdown (probably dumping to disk the data that wasn't on disk, but on memory instead), corrupting it
- other DBs weren't affected because they didn't use the postgres volume
- I've deployed everything and ran `make start`
- two DBs started to complain about WAL being corrupted (edited)
Author: serge1peshcoff
Date/time: 23rd August 2019
Result: wrong event deleted due to the bug in the events module
Summary:
Okay, so this was quite fun.
In events module, there is 'id' and 'url' fields. You can find the event either by ID (if it's numeric) or by URL. So, when you query the /events/:id - like endpoint, if the ID is numeric, it searches the event by ID and URL (since url can also be numeric), if not, it searches by url only.
We had two of the following events:
id | url | name
----+----------------------------------------------------------------------------------------------------------+----------------------------------------------------------
44 | | NWM Chisinau: Welcome Back to Casa Mare
22 | 44 | RTC London - "London, mind the RTC!"
Someone smart set the event URL of RTC London to 44. That resulted in the funny bug:
1) when you go to the listing of the events, you click on the RTC London and then (the link to this event is /events/44) you are redirected to the NWM Chisinau page.
2) one girl managed to delete the wrong event (RTC London instead of NWM Chisinau).