Cool stories from production

Author: serge1peshcoff
Date/time: 17th August 2019 00:00 CEST - 03:00 CEST
Result: 2 databases corrupted and destroyed without the possibility of restoring and eventually restored from backup
Summary:

Notes to myself:
- if you stop and then start the db container, it might suddenly crash and corrupt all of your data without recovery (happened with statutory and discounts)
- if you didn't make a db backup before attempting something, you're fucked up (luckily I did)
- trying to restore a db from a SQL dump is quite fun

What actually happened: apparently if you stop and start the db container, you can get this in postgres logs as a result:

```PANIC: could not locate a valid checkpoint record```

after which the db refuses to work and exits with non-zero response code. The top answer for dealing with it suggest resetting the WAL log:

```pg_resetwal -f DATADIR```

(which is another challenge to find in a docker container), and once I've done this on statutory I've realized that half of my data is missing because the table didn't contain like 7 or 8 migrations applied over the last 3 months.
So the only reasonable thing I've come to is to reset the db from the SQL dump I've made earlier (after searching for an hour how to transfer it to the container from the host machine if the container is not running) and it (apparently) worked. Yay.

So statutory db was repaired, and I've no idea what's up with the discounts database but it's up and running (resetting the WAL helped and didn't destroy everything)

Then I was trying for another hour to set secrets for oms-mailer and oms-mail-transfer-agent, but eventually I did it. So now it should be all okay.

Here's the supposedly how it worked:

- I've run `make stop`, which stopped all the containers
- postgres-oms-statutory and postgres-oms-discounts both tried to write something to WAL simultaneously on shutdown (probably dumping to disk the data that wasn't on disk, but on memory instead), corrupting it
- other DBs weren't affected because they didn't use the postgres volume
- I've deployed everything and ran `make start`
- two DBs started to complain about WAL being corrupted (edited)