← Course Index

Backups & Rollback Strategies

~25 min · Reliability

Ref

Primary Source

AWS RDS — Backups and Point-in-Time Recovery

Official guide to automated backups, manual snapshots, and restoring RDS databases. Read →

Three Failure Scenarios

Bad deploy — new version breaks production
Data corruption — a bug or bad migration corrupts your database
Infrastructure failure — server crash, AZ outage, accidental deletion

Each requires a different recovery strategy. Plan them before the emergency.

Application Rollback

# Rollback = deploy a previous Docker image tag (why you tag with git SHA)

# Find the last good commit
git log --oneline -10

# Pull and deploy the previous image
docker pull ghcr.io/username/myapp:abc1234   # last good SHA
docker tag ghcr.io/username/myapp:abc1234 myapp:current
docker compose up -d api

# Via GitHub Actions — add a workflow_dispatch rollback workflow:
on:
  workflow_dispatch:
    inputs:
      sha:
        description: 'Git SHA to rollback to'
        required: true

Database Backups

# RDS: automated backups enabled by default (7-35 days PITR)
# Enable: RDS → Modify → Backup retention: 7 days minimum

# Manual snapshot before risky migrations
aws rds create-db-snapshot   --db-instance-identifier myapp-db   --db-snapshot-identifier before-migration-20250601

# Point-in-time restore
aws rds restore-db-instance-to-point-in-time   --source-db-instance-identifier myapp-db   --target-db-instance-identifier myapp-db-restored   --restore-time "2025-06-01T14:30:00Z"

# Self-managed PostgreSQL: daily cron backup to S3
0 2 * * * pg_dump myapp | gzip | aws s3 cp - s3://backups/$(date +%Y%m%d).sql.gz

🚨

Test your backups. An untested backup is not a backup. Restore to a test DB at least quarterly — many teams discover backup failures only when they desperately need them.

Blue-Green Deployments

Spin up a new server with the new version, validate, then switch traffic. Zero downtime. Instant rollback by switching back.

# Simplified blue-green with Route 53:
# Blue: api.yourapp.com → old server IP
# Deploy new version to new server IP, test it directly
# Switch: Update Route 53 A record to new IP
# Rollback: Change A record back (instant — low TTL pre-set)

Check Your Understanding

1. A deploy with a database migration breaks production. You roll back the app code. What else might be needed?

2. You take a manual RDS snapshot before a migration. What does this enable?

3. In blue-green deployment, traffic is on blue and an issue is found. How do you rollback?