Restore drill runbook
Status. Round 6 PR2 of the Wave 0 plan — closes the §8 docs contract that the executable companion
scripts/restore-drill.shautomates. PR2a (this PR) ships the script + this runbook. The first executed drill against staging — with observed RTO/RPO numbers — is PR2b; the maintainer runs the drill and updates the §RTO/RPO observation section here with the artefact link.
This page tells the maintainer (and any future on-call) how to exercise Panorama's restore path before they need it for real. The drill answers two questions:
- Can we actually restore? Backups that aren't periodically exercised are aspirational. The drill walks dump → restore → verify end-to-end and produces a structured report.
- How long does it take? RTO (recovery time objective) and RPO (recovery point objective) are honest only when measured. The drill captures RTO components; the runbook explains how to translate them into the published commitment.
The drill is the executable counterpart of the policy documents:
incident.md— Phase 5 §"Recover" invokes this runbook when restore is the chosen recovery path.secrets-rotation.md— the cross-cutting §"Integration with the restore drill" section now links here.verify-audit-chain.md— the drill's chain-verify step exercises that runbook's CLI against the restored DB; a drill that passes is also a chain-integrity spot check.secrets-inventory.md— the drill consumesDATABASE_PRIVILEGED_URLfor the source DB; the inventory documents that secret class.
When to drill
| Trigger | Cadence |
|---|---|
| Wave 0 acceptance §8 closing | Once, against staging, by the maintainer (PR2b) |
| Scheduled hygiene | Quarterly drill against staging; capture artefacts under docs/audits/restore-drill-<ISODate>/ |
| Post-incident | Within 7 days of any P0 or P1 closure where backups were referenced in the contain or recover phase |
| Pre-major-upgrade | Before a Postgres major-version upgrade (e.g., 16 → 17) or a Supabase project migration |
| After any source-schema change that touches the audit chain | The drill is the canary for migration 0021 + the SECURITY DEFINER triggers; a chain that no longer verifies post-restore is a release blocker |
The drill is not a substitute for the chain-verify CLI running weekly in CI (per verify-audit-chain.md §"How to run"). The CLI is the daily-driver chain check; the drill also exercises pg_dump / pg_restore / RLS-policy reconstitution.
Pre-flight
Before invoking the drill, gather:
Source DB URL — the database to drill from. Two acceptable sources:
- Staging —
apps/core-api/.env.stagingis generated byscripts/setup-staging-env.shand contains the URL. Never source from production — the drill script will refuse a URL containingprodheuristics (see §Refuse-prod safety guard). - Local dev-stack —
postgres://panorama:panorama@localhost:5432/panorama?sslmode=disable. Useful for smoke-testing the script itself; numbers from a local dev DB are NOT the real RTO. Staging is the source for the published number.
- Staging —
Target DB URL — a SEPARATE database the drill restores into. The drill refuses to restore into the source itself. Two shapes:
- Local sibling DB — fastest path. Create on the dev-stack:bashURL:
docker exec docker_postgres_1 \ psql -U panorama -d postgres \ -c "CREATE DATABASE panorama_drill_target"postgres://panorama:panorama@localhost:5432/panorama_drill_target?sslmode=disable. Drop after the drill to keep the dev-stack tidy:bashdocker exec docker_postgres_1 \ psql -U panorama -d postgres \ -c "DROP DATABASE panorama_drill_target" - Throwaway Supabase project — true end-to-end shape for a real RTO claim. Spin up a free-tier project at supabase.com/dashboard; populate
apps/core-api/.env.drillvia a one-off run ofsetup-staging-env.sh; pass the direct URL via--dst-url. Cost: the free-tier project ships with one week of idle-pause; remember to delete it after the drill.
- Local sibling DB — fastest path. Create on the dev-stack:
Tooling —
pg_dump,pg_restore,psql,jq,pnpm. The drill script refuses to run without any of these. On Ubuntu / Debian:bashapt install postgresql-client-16 jqDATABASE_PRIVILEGED_URLfor the TARGET DB — chain-verify reads it from env via the wrapping drill script; you do not set it manually.Out-dir — where to write the drill artefacts. Default is
docs/audits/restore-drill-<UTC-ISO-timestamp>/. Override via--out-dirif you want the report under a custom path (e.g., per-incident:docs/audits/restore-drill-post-incident-#241/).
Running the drill
The canonical invocation:
scripts/restore-drill.sh \
--src-url "$SRC_URL" \
--dst-url "$DST_URL"Append --force-truncate-dst if the target has existing tables you want the drill to drop. (DROP SCHEMA public CASCADE; CREATE SCHEMA public; — destructive on the TARGET only, never the source.) The drill requires the target DB name to contain drill or test as an extra safety check; override with PANORAMA_DRILL_ALLOW_ANY_TARGET=1 if you've audited the target manually. Append --skip-chain-verify only if you're smoke-testing the dump/restore mechanics and don't have a populated audit chain yet.
Example — local sibling DB, drilling from staging:
set -a; source apps/core-api/.env.staging; set +a
scripts/restore-drill.sh \
--src-url "$DATABASE_PRIVILEGED_URL" \
--dst-url "postgres://panorama:panorama@localhost:5432/panorama_drill_target?sslmode=disable"What it does in order:
- Probe both URLs (verifies reachability + Postgres version match loosely).
- Refuse-prod safety check on both URLs (heuristic — catches the sleep-deprived miss-paste, NOT an adversarial input).
- Capture source-side baseline counts (migrations, audit_events, tenants, users, memberships, RLS policies, audit_events tail hash). Written to
source-baselines.json. pg_dumpsource in custom format →source.dump. Records wall-clock duration as dump RTO component.pg_restoreinto target. Records wall-clock duration as restore RTO component.- Capture target-side baseline counts. Written to
target-baselines.json. - Diff source vs target baselines. Written to
drift.json— every key is{ src, dst, match }. The drill expectsmatch: truefor every key. - Run
pnpm chain-verifyagainst the restored target DB. Records wall-clock duration as verify RTO component. Three chain states:OK— every non-legacy row verifies; the chain survived dump/restore byte-exact.INCONCLUSIVE— source had an empty audit_events table. Realistic only when drilling against a freshly-reset local dev DB. Re-run against staging for a real result.FAIL— at least one row's selfHash does not match its recomputed digest. This is a tamper signal, not a restore-mechanic failure: the dump captured a chain that was already broken at the source, or the restore corrupted thedigestPreImagecolumn. Investigate perverify-audit-chain.md.
- Emit
report.jsonwith the RTO components + verification outcome. Print a human-readable summary to stdout.
The drill exit code:
| Exit | Meaning |
|---|---|
| 0 | Drill passed: counts match AND chain-verify is OK (or skipped) |
| 1 | Drill failed: count mismatch OR chain-verify reports tamper |
| 2 | Operational error: missing tooling, refuse-prod triggered, pg_dump or pg_restore failed |
| 3 | Drill INCONCLUSIVE: counts match but chain-verify reported empty source — drill mechanic worked, but a freshly-reset dev DB has nothing to verify. Not the same as PASS. Re-run against a populated source for a real result. |
What "successful" looks like
A passing drill against a populated source emits a report.json shaped like:
{
"drill_executed_at": "2026-05-17T19:30:00Z",
"rto_seconds_total": 47,
"components": {
"dump_seconds": 12,
"restore_seconds": 22,
"verify_seconds": 13,
"dump_bytes": 184329216
},
"verification": {
"counts_match": true,
"chain_verify": "true"
},
"notes_for_pr2b": "..."
}counts_match: true AND chain_verify: "true" is the two-element verification contract. If either is missing, the drill is not yet acceptable for the Wave 0 §8 close.
When the drill fails partway
pg_dump or pg_restore can fail halfway — disk full, network drop, target schema collision. The script's set -euo pipefail exits non-zero on the first failure, but leaves the partial artefacts behind. Safe-retry pattern:
# 1. Inspect the failure mode:
ls -la docs/audits/restore-drill-<ts>/
cat docs/audits/restore-drill-<ts>/dump.log # if pg_dump failed
cat docs/audits/restore-drill-<ts>/restore.log # if pg_restore failed
# 2. Drop the target DB to discard any half-restored state:
docker exec docker_postgres_1 \
psql -U panorama -d postgres \
-c "DROP DATABASE panorama_drill_target"
docker exec docker_postgres_1 \
psql -U panorama -d postgres \
-c "CREATE DATABASE panorama_drill_target"
# 3. Discard the partial drill artefacts (the dump file may be
# truncated; the report.json may be missing):
rm -rf docs/audits/restore-drill-<ts>/
# 4. Re-run the drill — it'll write to a fresh timestamped
# out-dir (or pass --out-dir to a specific path).
scripts/restore-drill.sh ...If pg_restore succeeded but the bootstrap + rls.sql re-apply step failed (see reapply-migrations.log), the target DB has tables + data but no grants. Symptoms: panorama_app cannot SELECT post- restore. Fix:
# Manually re-run the bootstrap + rls.sql loop:
psql "$DST_URL" -v ON_ERROR_STOP=1 \
-f apps/core-api/prisma/supabase-bootstrap.sql
for rls in apps/core-api/prisma/migrations/*/rls.sql; do
psql "$DST_URL" -v ON_ERROR_STOP=1 -f "$rls"
doneThen re-run the drill with --out-dir pointing at a fresh path — the count/chain-verify portion will succeed against the now-grant'd target.
RTO / RPO observation
RTO
The drill measures three components:
dump_seconds— how longpg_dumptakes against the source. Scales linearly with database size; expect ~1 second per 10 MB on a healthy connection.restore_seconds— how longpg_restoretakes against the target. Typically 2-5× the dump duration (index rebuild dominates).verify_seconds— how long the chain-verify CLI takes to walk every audit row. Scales linearly with audit_events row count.
rto_seconds_total = dump + restore + verify.
The drill RTO is NOT the published RTO. A real incident adds operator-side overhead the drill cannot capture:
| Phase | Realistic duration | Notes |
|---|---|---|
| Provisioning a fresh Supabase project | ~10 minutes | Free tier; manual via dashboard |
| Restoring the Supabase backup snapshot | ~5-15 minutes | Per Supabase docs; scales with DB size |
Drill script RTO (rto_seconds_total) | ~dump+restore+verify | Captured by the script |
| Updating Fly secrets + rolling deploy | ~3-5 minutes | Per secrets-rotation.md §DATABASE_URL |
| DNS cutover (if applicable) | ~5 minutes + TTL | Only if new project has different hostname |
| Smoke-test the restored app surface | ~5 minutes | Login, sample query, photo fetch |
Worked example. For a ~10 MB staging dump, drill RTO is ~30 seconds (~3s dump + ~5s restore + ~22s verify with a populated chain). Add the operator overhead: 10 min provisioning + 10 min Supabase backup restore + 0.5 min drill + 5 min secrets + 5 min smoke-test = ~30 minutes published RTO.
For a public-preview Community deployment, 30 minutes is the working RTO claim. PR2b records both the script number and the operator-observed wall-clock so the gap between them is visible. At first real-customer scale the published RTO becomes business-driven (do paying tenants accept 30 min?); the gap from script-RTO to wall-clock RTO is where the Enterprise managed runbook earns its keep.
RPO
The drill does NOT measure RPO. RPO is the maximum tolerable data loss in time and depends on the source's backup cadence:
- Supabase free tier: daily snapshots, retained 7 days. RPO = up to 24 hours of writes lost.
- Supabase Pro tier: point-in-time recovery (PITR) within the retention window. RPO = seconds (PITR resolution).
- Self-host with manual
pg_basebackup: whatever cadence the operator runs. The runbook does not pre-suppose one.
The drill confirms the restore mechanic works given a snapshot; it cannot confirm the snapshot cadence is acceptable. That's a separate operator decision per docs/en/self-hosting.md §"Backup policy".
For the Community hosted instance during the public-preview phase, the working RPO is 24 hours (Supabase free-tier daily). When a paying tenant lands, the hosted instance moves to Pro and RPO drops to 5 minutes (Supabase Pro PITR resolution). This trade-off is explicit in docs/audits/HANDOFF-2026-05-09-session-end.md §4 — daily-only RPO is acceptable for free preview.
Recording the drill artefact
After the drill completes, commit the docs/audits/restore-drill-<date>/ directory. The directory contains:
report.json— drill metadata (RTO components, verification outcome). Machine-readable; the source of truth for the numbers.source-baselines.json/target-baselines.json— pre/post counts; reproducible by re-running the captured queries.drift.json— per-key match against source.chain-verify.json— chain-verify CLI JSON output. On chain-verify OK this is a 1-line summary (verified count, legacy count, mismatches=0); safe to commit. On chain-verify FAIL the JSON containsfirstMismatchwith row id + tenantId- action — treat this case as confidential and redact
tenantId/actionbefore sharing in a public follow-up issue.
- action — treat this case as confidential and redact
chain-verify.txt— human-readable chain-verify output. Same PII trade-off aschain-verify.json.dump.log/restore.log/reapply-migrations.log— pg_dump- pg_restore + bootstrap stderr. May contain row values on error. Excluded from git via
.gitignore(see repo root.gitignorefor thedocs/audits/restore-drill-*/block).
- pg_restore + bootstrap stderr. May contain row values on error. Excluded from git via
source.dump— the dump file ITSELF. Full DB export including audit metadata, tenant data, rotation events at rest. Excluded from git via.gitignore; the drill script does NOT delete the dump on success because you may want to spot-check it. Delete or move to encrypted storage manually once the drill report is committed.
Add a one-page README.md to the drill directory documenting:
- Source DB: which database (staging? local dev?)
- Target DB: where you restored to (local sibling? Supabase throwaway project?)
- Trigger: scheduled hygiene / pre-upgrade / post-incident / Wave 0 §8 close
- Observed wall-clock RTO: how long did the FULL operation take from "I'm going to drill now" to "report.json written"? This includes the operator overhead the drill script cannot capture.
- Anything surprising: did pg_restore emit any unexpected warnings? Did chain-verify report any legacy rows? Did the drift.json show any unexpected non-match?
- Follow-ups filed: file GitHub issues for any drift or warning surfaced; link them here. Tag
restore-drill-followup.
Restoring object storage
The drill covers the DB layer only. Photos and tenant-export tarballs live in S3 / R2; restoring them is a separate procedure:
- S3 / R2 default retention. Cloudflare R2 retains deleted objects for 30 days by default; AWS S3 with lifecycle rules varies. The drill does NOT exercise object-storage restore — the test cost is non-trivial (potentially a full bucket clone). If the operator's threat model includes object-storage ransomware or accidental bucket-wide delete, document the bucket recovery procedure in
docs/runbooks/object-storage-restore.md(future runbook; not yet shipping). - Tenant-export object lifecycle. Tenant export tarballs are TTL'd (per ADR-0020 §8 — 24h job window + S3 lifecycle rule after
expiresAt). The drill does NOT preserve these; a restored DB row referencing a TTL'd S3 object will surface as a 404 on download. This is correct behaviour: exports past TTL are intentionally unrecoverable.
Refuse-prod safety guard
The drill script will refuse to run with --src-url or --dst-url that matches the regex (panorama-prod|panorama-hosted|panorama\.app|panorama-fleet|prod\.supabase\.co).
This is a typo-catcher, not a security boundary. An adversarial operator can rename their prod environment to evade. The point is to make the obvious miss-paste impossible at 5am.
If you genuinely need to drill against a production replica (e.g., a forensic restore into a different region for investigation), edit scripts/restore-drill.sh's refuse_prod() function and document the exception inline. Do not pass a hidden flag — the edit creates an audit trail in git log.
What this runbook does NOT cover
- Live failover to the restored DB. Pointing production traffic at the restored target is
DATABASE_URL/ DNS / Fly secret work covered bysecrets-rotation.md§DATABASE_URL — not by this runbook. - Multi-region disaster recovery with active replicas. Out of scope for Community; Enterprise edition with multi-region read-replica architecture handles this.
- Backup creation cadence and retention policy. The drill exercises an already-taken backup; setting the backup schedule is operator-side at the Supabase project level. See
docs/en/self-hosting.md§"Backup policy" for the self-host recommendation. - Forensic preservation of a tampered DB. If the source has a confirmed tamper (chain-verify FAIL), the drill is the wrong tool — preserve the source first, then restore from a pre-tamper backup. See
incident.mdPhase 5 §"Recover" for the decision tree. - Object-storage restore. Photos and tenant-export tarballs in S3 / R2 are NOT part of the drill — see §"Restoring object storage" above for the boundary. A real DR scenario where the DB is restored but photos are gone leaves the runtime in a half-state (asset rows reference 404'd S3 objects). Document this in the post-mortem if it bites in production.
- Self-host customisation of refuse-prod heuristic. The default regex matches the maintainer's hosted-instance naming (
panorama-prod,prod.supabase.co, etc.). Self-hosters running production at a different hostname must exportPANORAMA_DRILL_REFUSE_REGEXwith their own prod-host pattern, OR rely on operational discipline of separate.envfiles. The unmodified script has zero refuse-prod gate on self-host naming conventions.
Drill cadence enforcement
A GitHub Actions cron-driven restore-drill-due issue opener is a Round 7 follow-up to enforce the quarterly cadence (sibling to the secrets-rotation-due cron mentioned in secrets-rotation.md §"Rotation hygiene cadence"). Until it lands, the operator's .calendar is the enforcement mechanism. Tag a follow-up issue restore-drill-due when you create a manual calendar reminder so the cron, once it ships, can pick up where the calendar left off.
Multi-tenant orchestration
Drilling a restore across a fleet of hosted-tenant instances with per-tenant evidence capture is a managed-service concern and will ship in the Enterprise edition (forward-looking — Panorama is pre-revenue and Community-only today). See the feature matrix Backups row for the Community-vs-Enterprise positioning. The single-DB drill procedure above is the Community surface and the self-hoster contract.
If you got here from incident.md
You're in Phase 5 (Recover). After the drill produces a clean report:
- Point the runtime at the restored target — see
secrets-rotation.md§"DATABASE_URL" for the Fly secrets + rolling-deploy sequence. - Notify affected tenants per
incident.mdif you escalated from Phase 4 — restore from backup is a confirmation-trigger event under most P0 conditions. - Continue Phase 5 step 3 (post-mortem within 7 days) in
incident.md.