Why do we need a Backup/ Recovery Strategy ?
Backup and Recovery Strategies are absolutely essential for uninterrupted operation of any live business unit. The strategy must plan for recovery in every catastrophe such as:
- Device Failure: Loss of Machine or Disk
- Failure during Maintenance: Hardware or software upgrades
- Site Failure: Failure at Datacenter or a network failure
- Blunders by Operators: Devops/ System operator drops a table/ schema/ datafile. Github is a recent example – https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/
- Data Corruption: Application introduces poor code and in turn corrupts the data or Disk gets corrupted
- Compliance: Data Retention Periods, Storage of Readable and Writeable data.
Profitability guides the strategy – Business Impact Vs Cost. Decision revolves around questions like:
- How long will recovery take?
- How much will data storage cost?
- How long do we need to store backup data?
- Will an outage affect the Brand?
- Can the database or any part remain operational during backup and/or recovery?
- Will the root cause of failure be tracked with the strategy?
There are three fundamentally different approaches to backing up Postgres data-
- SQL dump
You can create a text file with SQL commands using Postgres utility program pg_dump. pg_dumpall is used to dump an entire database cluster in plain-text SQL format alongwith user, groups, and associated permissions
CONS: Dumps created by pg_dump are internally consistent, that is, the dump represents a snapshot of the database as of the time pg_dump begins running. The utility does not block readers or writers.
- File system level backup (Cold Backup)
An alternative backup strategy is to directly copy the files that Postgres uses to store the data in the database.
CONS: The database server must be shut down in order to get a usable backup. File system backups only work for complete backup and restoration of an entire database cluster
- Continuous Archiving (Hot Backup)
At all times, PostgreSQL maintains a Write Ahead Log (WAL) in the pg_xlog/ subdirectory of the cluster’s data directory. The log records every change made to the database’s data files. This log exists primarily for crash-safety purposes: if the system crashes, the database can be restored to consistency by “replaying” the log entries made since the last checkpoint. However, the existence of the log makes it possible to use a third strategy for backing up databases: we can combine a file-system-level backup with backup of the WAL files. If recovery is needed, we restore the file system backup and then replay from the backed-up WAL files to bring the system to a current state.
- This allows the database to stay operational during backup and enables online file system backup of a database cluster
- We do not need a perfectly consistent file system backup as the starting point. Any internal inconsistency in the backup will be corrected by log replay (this is not significantly different from what happens during crash recovery). So we do not need a file system snapshot capability, just tar or a similar archiving tool.
- Since we can combine an indefinitely long sequence of WAL files for replay, continuous backup can be achieved simply by continuing to archive the WAL files. This is particularly valuable for large databases, where it might not be convenient to take a full backup frequently.
- It is not necessary to replay the WAL entries all the way to the end. We could stop the replay at any point and have a consistent snapshot of the database as it was at that time. Thus, this technique supports point-in-time recovery: it is possible to restore the database to its state at any time since your base backup was taken.
- If we continuously feed the series of WAL files to another machine that has been loaded with the same base backup file, we have a warm standby system: at any point we can bring up the second machine and it will have a nearly-current copy of the database.
- As with the plain file-system-backup technique, this method can only support restoration of an entire database cluster, not a subset.
- Also, it requires a lot of archival storage: the base backup might be bulky, and a busy system will generate many megabytes of WAL traffic that have to be archived. Still, it is the preferred backup technique in many situations where high reliability is needed.
Requirements for Postgres PITR
- A full/ Base backup (This section is WIP)
- The following must be set in postgresql.conf:
- wal_level to archive or hot_standby (For details refer postgres official page)
- archive_mode to on
- archive_command that performs archiving only when a switch file exists and supports PITR. Eg.:
archive_command = ‘test ! -f /var/lib/pgsql/backup_in_progress || (test ! -f /var/lib/pgsql/archive/%f && cp %p /var/lib/pgsql/archive/%f)’
Tools available for Postgres PITR
(This section is WIP)
Other Available Recovery Solutions
- Replication with Failover and Switchback
- Delayed Replication