Skip to content

Restore drill runbook

Status. Round 6 PR2 of the Wave 0 plan — closes the §8 docs contract that the executable companion scripts/restore-drill.sh automates. PR2a (this PR) ships the script + this runbook. The first executed drill against staging — with observed RTO/RPO numbers — is PR2b; the maintainer runs the drill and updates the §RTO/RPO observation section here with the artefact link.

This page tells the maintainer (and any future on-call) how to exercise Panorama's restore path before they need it for real. The drill answers two questions:

  1. Can we actually restore? Backups that aren't periodically exercised are aspirational. The drill walks dump → restore → verify end-to-end and produces a structured report.
  2. How long does it take? RTO (recovery time objective) and RPO (recovery point objective) are honest only when measured. The drill captures RTO components; the runbook explains how to translate them into the published commitment.

The drill is the executable counterpart of the policy documents:

  • incident.md — Phase 5 §"Recover" invokes this runbook when restore is the chosen recovery path.
  • secrets-rotation.md — the cross-cutting §"Integration with the restore drill" section now links here.
  • verify-audit-chain.md — the drill's chain-verify step exercises that runbook's CLI against the restored DB; a drill that passes is also a chain-integrity spot check.
  • secrets-inventory.md — the drill consumes DATABASE_PRIVILEGED_URL for the source DB; the inventory documents that secret class.

When to drill

TriggerCadence
Wave 0 acceptance §8 closingOnce, against staging, by the maintainer (PR2b)
Scheduled hygieneQuarterly drill against staging; capture artefacts under docs/audits/restore-drill-<ISODate>/
Post-incidentWithin 7 days of any P0 or P1 closure where backups were referenced in the contain or recover phase
Pre-major-upgradeBefore a Postgres major-version upgrade (e.g., 16 → 17) or a Supabase project migration
After any source-schema change that touches the audit chainThe drill is the canary for migration 0021 + the SECURITY DEFINER triggers; a chain that no longer verifies post-restore is a release blocker

The drill is not a substitute for the chain-verify CLI running weekly in CI (per verify-audit-chain.md §"How to run"). The CLI is the daily-driver chain check; the drill also exercises pg_dump / pg_restore / RLS-policy reconstitution.

Pre-flight

Before invoking the drill, gather:

  1. Source DB URL — the database to drill from. Two acceptable sources:

    • Stagingapps/core-api/.env.staging is generated by scripts/setup-staging-env.sh and contains the URL. Never source from production — the drill script will refuse a URL containing prod heuristics (see §Refuse-prod safety guard).
    • Local dev-stackpostgres://panorama:panorama@localhost:5432/panorama?sslmode=disable. Useful for smoke-testing the script itself; numbers from a local dev DB are NOT the real RTO. Staging is the source for the published number.
  2. Target DB URL — a SEPARATE database the drill restores into. The drill refuses to restore into the source itself. Two shapes:

    • Local sibling DB — fastest path. Create on the dev-stack:
      bash
      docker exec docker_postgres_1 \
          psql -U panorama -d postgres \
          -c "CREATE DATABASE panorama_drill_target"
      URL: postgres://panorama:panorama@localhost:5432/panorama_drill_target?sslmode=disable. Drop after the drill to keep the dev-stack tidy:
      bash
      docker exec docker_postgres_1 \
          psql -U panorama -d postgres \
          -c "DROP DATABASE panorama_drill_target"
    • Throwaway Supabase project — true end-to-end shape for a real RTO claim. Spin up a free-tier project at supabase.com/dashboard; populate apps/core-api/.env.drill via a one-off run of setup-staging-env.sh; pass the direct URL via --dst-url. Cost: the free-tier project ships with one week of idle-pause; remember to delete it after the drill.
  3. Toolingpg_dump, pg_restore, psql, jq, pnpm. The drill script refuses to run without any of these. On Ubuntu / Debian:

    bash
    apt install postgresql-client-16 jq
  4. DATABASE_PRIVILEGED_URL for the TARGET DB — chain-verify reads it from env via the wrapping drill script; you do not set it manually.

  5. Out-dir — where to write the drill artefacts. Default is docs/audits/restore-drill-<UTC-ISO-timestamp>/. Override via --out-dir if you want the report under a custom path (e.g., per-incident: docs/audits/restore-drill-post-incident-#241/).

Running the drill

The canonical invocation:

bash
scripts/restore-drill.sh \
    --src-url "$SRC_URL" \
    --dst-url "$DST_URL"

Append --force-truncate-dst if the target has existing tables you want the drill to drop. (DROP SCHEMA public CASCADE; CREATE SCHEMA public; — destructive on the TARGET only, never the source.) The drill requires the target DB name to contain drill or test as an extra safety check; override with PANORAMA_DRILL_ALLOW_ANY_TARGET=1 if you've audited the target manually. Append --skip-chain-verify only if you're smoke-testing the dump/restore mechanics and don't have a populated audit chain yet.

Example — local sibling DB, drilling from staging:

bash
set -a; source apps/core-api/.env.staging; set +a
scripts/restore-drill.sh \
    --src-url "$DATABASE_PRIVILEGED_URL" \
    --dst-url "postgres://panorama:panorama@localhost:5432/panorama_drill_target?sslmode=disable"

What it does in order:

  1. Probe both URLs (verifies reachability + Postgres version match loosely).
  2. Refuse-prod safety check on both URLs (heuristic — catches the sleep-deprived miss-paste, NOT an adversarial input).
  3. Capture source-side baseline counts (migrations, audit_events, tenants, users, memberships, RLS policies, audit_events tail hash). Written to source-baselines.json.
  4. pg_dump source in custom format → source.dump. Records wall-clock duration as dump RTO component.
  5. pg_restore into target. Records wall-clock duration as restore RTO component.
  6. Capture target-side baseline counts. Written to target-baselines.json.
  7. Diff source vs target baselines. Written to drift.json — every key is { src, dst, match }. The drill expects match: true for every key.
  8. Run pnpm chain-verify against the restored target DB. Records wall-clock duration as verify RTO component. Three chain states:
    • OK — every non-legacy row verifies; the chain survived dump/restore byte-exact.
    • INCONCLUSIVE — source had an empty audit_events table. Realistic only when drilling against a freshly-reset local dev DB. Re-run against staging for a real result.
    • FAIL — at least one row's selfHash does not match its recomputed digest. This is a tamper signal, not a restore-mechanic failure: the dump captured a chain that was already broken at the source, or the restore corrupted the digestPreImage column. Investigate per verify-audit-chain.md.
  9. Emit report.json with the RTO components + verification outcome. Print a human-readable summary to stdout.

The drill exit code:

ExitMeaning
0Drill passed: counts match AND chain-verify is OK (or skipped)
1Drill failed: count mismatch OR chain-verify reports tamper
2Operational error: missing tooling, refuse-prod triggered, pg_dump or pg_restore failed
3Drill INCONCLUSIVE: counts match but chain-verify reported empty source — drill mechanic worked, but a freshly-reset dev DB has nothing to verify. Not the same as PASS. Re-run against a populated source for a real result.

What "successful" looks like

A passing drill against a populated source emits a report.json shaped like:

json
{
  "drill_executed_at": "2026-05-17T19:30:00Z",
  "rto_seconds_total": 47,
  "components": {
    "dump_seconds": 12,
    "restore_seconds": 22,
    "verify_seconds": 13,
    "dump_bytes": 184329216
  },
  "verification": {
    "counts_match": true,
    "chain_verify": "true"
  },
  "notes_for_pr2b": "..."
}

counts_match: true AND chain_verify: "true" is the two-element verification contract. If either is missing, the drill is not yet acceptable for the Wave 0 §8 close.

When the drill fails partway

pg_dump or pg_restore can fail halfway — disk full, network drop, target schema collision. The script's set -euo pipefail exits non-zero on the first failure, but leaves the partial artefacts behind. Safe-retry pattern:

bash
# 1. Inspect the failure mode:
ls -la docs/audits/restore-drill-<ts>/
cat docs/audits/restore-drill-<ts>/dump.log     # if pg_dump failed
cat docs/audits/restore-drill-<ts>/restore.log  # if pg_restore failed

# 2. Drop the target DB to discard any half-restored state:
docker exec docker_postgres_1 \
    psql -U panorama -d postgres \
    -c "DROP DATABASE panorama_drill_target"
docker exec docker_postgres_1 \
    psql -U panorama -d postgres \
    -c "CREATE DATABASE panorama_drill_target"

# 3. Discard the partial drill artefacts (the dump file may be
#    truncated; the report.json may be missing):
rm -rf docs/audits/restore-drill-<ts>/

# 4. Re-run the drill — it'll write to a fresh timestamped
#    out-dir (or pass --out-dir to a specific path).
scripts/restore-drill.sh ...

If pg_restore succeeded but the bootstrap + rls.sql re-apply step failed (see reapply-migrations.log), the target DB has tables + data but no grants. Symptoms: panorama_app cannot SELECT post- restore. Fix:

bash
# Manually re-run the bootstrap + rls.sql loop:
psql "$DST_URL" -v ON_ERROR_STOP=1 \
    -f apps/core-api/prisma/supabase-bootstrap.sql
for rls in apps/core-api/prisma/migrations/*/rls.sql; do
    psql "$DST_URL" -v ON_ERROR_STOP=1 -f "$rls"
done

Then re-run the drill with --out-dir pointing at a fresh path — the count/chain-verify portion will succeed against the now-grant'd target.

RTO / RPO observation

RTO

The drill measures three components:

  1. dump_seconds — how long pg_dump takes against the source. Scales linearly with database size; expect ~1 second per 10 MB on a healthy connection.
  2. restore_seconds — how long pg_restore takes against the target. Typically 2-5× the dump duration (index rebuild dominates).
  3. verify_seconds — how long the chain-verify CLI takes to walk every audit row. Scales linearly with audit_events row count.

rto_seconds_total = dump + restore + verify.

The drill RTO is NOT the published RTO. A real incident adds operator-side overhead the drill cannot capture:

PhaseRealistic durationNotes
Provisioning a fresh Supabase project~10 minutesFree tier; manual via dashboard
Restoring the Supabase backup snapshot~5-15 minutesPer Supabase docs; scales with DB size
Drill script RTO (rto_seconds_total)~dump+restore+verifyCaptured by the script
Updating Fly secrets + rolling deploy~3-5 minutesPer secrets-rotation.md §DATABASE_URL
DNS cutover (if applicable)~5 minutes + TTLOnly if new project has different hostname
Smoke-test the restored app surface~5 minutesLogin, sample query, photo fetch

Worked example. For a ~10 MB staging dump, drill RTO is ~30 seconds (~3s dump + ~5s restore + ~22s verify with a populated chain). Add the operator overhead: 10 min provisioning + 10 min Supabase backup restore + 0.5 min drill + 5 min secrets + 5 min smoke-test = ~30 minutes published RTO.

For a public-preview Community deployment, 30 minutes is the working RTO claim. PR2b records both the script number and the operator-observed wall-clock so the gap between them is visible. At first real-customer scale the published RTO becomes business-driven (do paying tenants accept 30 min?); the gap from script-RTO to wall-clock RTO is where the Enterprise managed runbook earns its keep.

RPO

The drill does NOT measure RPO. RPO is the maximum tolerable data loss in time and depends on the source's backup cadence:

  • Supabase free tier: daily snapshots, retained 7 days. RPO = up to 24 hours of writes lost.
  • Supabase Pro tier: point-in-time recovery (PITR) within the retention window. RPO = seconds (PITR resolution).
  • Self-host with manual pg_basebackup: whatever cadence the operator runs. The runbook does not pre-suppose one.

The drill confirms the restore mechanic works given a snapshot; it cannot confirm the snapshot cadence is acceptable. That's a separate operator decision per docs/en/self-hosting.md §"Backup policy".

For the Community hosted instance during the public-preview phase, the working RPO is 24 hours (Supabase free-tier daily). When a paying tenant lands, the hosted instance moves to Pro and RPO drops to 5 minutes (Supabase Pro PITR resolution). This trade-off is explicit in docs/audits/HANDOFF-2026-05-09-session-end.md §4 — daily-only RPO is acceptable for free preview.

Recording the drill artefact

After the drill completes, commit the docs/audits/restore-drill-<date>/ directory. The directory contains:

  • report.json — drill metadata (RTO components, verification outcome). Machine-readable; the source of truth for the numbers.
  • source-baselines.json / target-baselines.json — pre/post counts; reproducible by re-running the captured queries.
  • drift.json — per-key match against source.
  • chain-verify.json — chain-verify CLI JSON output. On chain-verify OK this is a 1-line summary (verified count, legacy count, mismatches=0); safe to commit. On chain-verify FAIL the JSON contains firstMismatch with row id + tenantId
    • action — treat this case as confidential and redact tenantId / action before sharing in a public follow-up issue.
  • chain-verify.txt — human-readable chain-verify output. Same PII trade-off as chain-verify.json.
  • dump.log / restore.log / reapply-migrations.log — pg_dump
    • pg_restore + bootstrap stderr. May contain row values on error. Excluded from git via .gitignore (see repo root .gitignore for the docs/audits/restore-drill-*/ block).
  • source.dump — the dump file ITSELF. Full DB export including audit metadata, tenant data, rotation events at rest. Excluded from git via .gitignore; the drill script does NOT delete the dump on success because you may want to spot-check it. Delete or move to encrypted storage manually once the drill report is committed.

Add a one-page README.md to the drill directory documenting:

  • Source DB: which database (staging? local dev?)
  • Target DB: where you restored to (local sibling? Supabase throwaway project?)
  • Trigger: scheduled hygiene / pre-upgrade / post-incident / Wave 0 §8 close
  • Observed wall-clock RTO: how long did the FULL operation take from "I'm going to drill now" to "report.json written"? This includes the operator overhead the drill script cannot capture.
  • Anything surprising: did pg_restore emit any unexpected warnings? Did chain-verify report any legacy rows? Did the drift.json show any unexpected non-match?
  • Follow-ups filed: file GitHub issues for any drift or warning surfaced; link them here. Tag restore-drill-followup.

Restoring object storage

The drill covers the DB layer only. Photos and tenant-export tarballs live in S3 / R2; restoring them is a separate procedure:

  • S3 / R2 default retention. Cloudflare R2 retains deleted objects for 30 days by default; AWS S3 with lifecycle rules varies. The drill does NOT exercise object-storage restore — the test cost is non-trivial (potentially a full bucket clone). If the operator's threat model includes object-storage ransomware or accidental bucket-wide delete, document the bucket recovery procedure in docs/runbooks/object-storage-restore.md (future runbook; not yet shipping).
  • Tenant-export object lifecycle. Tenant export tarballs are TTL'd (per ADR-0020 §8 — 24h job window + S3 lifecycle rule after expiresAt). The drill does NOT preserve these; a restored DB row referencing a TTL'd S3 object will surface as a 404 on download. This is correct behaviour: exports past TTL are intentionally unrecoverable.

Refuse-prod safety guard

The drill script will refuse to run with --src-url or --dst-url that matches the regex (panorama-prod|panorama-hosted|panorama\.app|panorama-fleet|prod\.supabase\.co).

This is a typo-catcher, not a security boundary. An adversarial operator can rename their prod environment to evade. The point is to make the obvious miss-paste impossible at 5am.

If you genuinely need to drill against a production replica (e.g., a forensic restore into a different region for investigation), edit scripts/restore-drill.sh's refuse_prod() function and document the exception inline. Do not pass a hidden flag — the edit creates an audit trail in git log.

What this runbook does NOT cover

  • Live failover to the restored DB. Pointing production traffic at the restored target is DATABASE_URL / DNS / Fly secret work covered by secrets-rotation.md §DATABASE_URL — not by this runbook.
  • Multi-region disaster recovery with active replicas. Out of scope for Community; Enterprise edition with multi-region read-replica architecture handles this.
  • Backup creation cadence and retention policy. The drill exercises an already-taken backup; setting the backup schedule is operator-side at the Supabase project level. See docs/en/self-hosting.md §"Backup policy" for the self-host recommendation.
  • Forensic preservation of a tampered DB. If the source has a confirmed tamper (chain-verify FAIL), the drill is the wrong tool — preserve the source first, then restore from a pre-tamper backup. See incident.md Phase 5 §"Recover" for the decision tree.
  • Object-storage restore. Photos and tenant-export tarballs in S3 / R2 are NOT part of the drill — see §"Restoring object storage" above for the boundary. A real DR scenario where the DB is restored but photos are gone leaves the runtime in a half-state (asset rows reference 404'd S3 objects). Document this in the post-mortem if it bites in production.
  • Self-host customisation of refuse-prod heuristic. The default regex matches the maintainer's hosted-instance naming (panorama-prod, prod.supabase.co, etc.). Self-hosters running production at a different hostname must export PANORAMA_DRILL_REFUSE_REGEX with their own prod-host pattern, OR rely on operational discipline of separate .env files. The unmodified script has zero refuse-prod gate on self-host naming conventions.

Drill cadence enforcement

A GitHub Actions cron-driven restore-drill-due issue opener is a Round 7 follow-up to enforce the quarterly cadence (sibling to the secrets-rotation-due cron mentioned in secrets-rotation.md §"Rotation hygiene cadence"). Until it lands, the operator's .calendar is the enforcement mechanism. Tag a follow-up issue restore-drill-due when you create a manual calendar reminder so the cron, once it ships, can pick up where the calendar left off.

Multi-tenant orchestration

Drilling a restore across a fleet of hosted-tenant instances with per-tenant evidence capture is a managed-service concern and will ship in the Enterprise edition (forward-looking — Panorama is pre-revenue and Community-only today). See the feature matrix Backups row for the Community-vs-Enterprise positioning. The single-DB drill procedure above is the Community surface and the self-hoster contract.


If you got here from incident.md

You're in Phase 5 (Recover). After the drill produces a clean report:

  1. Point the runtime at the restored target — see secrets-rotation.md §"DATABASE_URL" for the Fly secrets + rolling-deploy sequence.
  2. Notify affected tenants per incident.md if you escalated from Phase 4 — restore from backup is a confirmation-trigger event under most P0 conditions.
  3. Continue Phase 5 step 3 (post-mortem within 7 days) in incident.md.

AGPL-3.0-or-later · LICENSE