Run Card — Production Database Unreachable

> MAJOR ops

RC-DB-001 v1.0 TIER 3 · SPECIFIC APP BOX 3 P1 — CRITICAL

Run Card Browser › IT Infrastructure › Database Services › Production Database

Specific App Run Card · IT Infrastructure › Database Services

Production Database
Unreachable

ALARM LEVEL

BOX 3

FULL RESPONSE

Initial Exposure

T3C(~)R(~)

Typical at open — 3 teams engaged, confidence and recovery path not yet assessed.
Update every 15 minutes. Publish on bridge and in incident chat.

Required Responders

Role	Who	Required	First Action
MIM / IC	Certified Major Incident Manager	Required	Open incident, assign command, declare Box 3
Operations Chief	Senior DBA or Infrastructure Lead	Required	Triage DB connectivity, confirm scope
App SME	Application owner(s) for affected services	Required	Confirm app-layer impact, check retry behavior
Network/Infra	Network or cloud infrastructure engineer	Required	Check network path, DNS, firewall rules
PIO / Comms	Customer Communications Lead	Required	Draft initial customer alert, update status page
Security / SO	InfoSec or Compliance Lead	If data exposure risk	Assess data breach risk, advise on recovery path safety
Vendor / LO	Cloud provider or DB vendor contact	If cloud/vendor DB	Open P1 support ticket, get on vendor bridge

Conditions to Gather

Which database(s)? Primary, replica, both? Specific host?
When did it go unreachable? First alert time vs. confirmed outage time
What can connect and what can't? App servers? Read replicas? Admin access?
Any recent changes? Deployments, schema migrations, infra changes, cert rotations?
Error message or log snippet? Connection timeout? Auth failure? Port unreachable?
Customer impact confirmed? Which features / customers / regions?
Is this a cloud-managed DB? RDS, Cloud SQL, Atlas, Azure SQL, etc.?

Initial Actions

MIM opens incident Severity: P1. Alarm Level: Box 3. Assign command.
Ping DB host directly Network vs. DB process — distinguish quickly
Check DB process status Is the DB process up? Has it crashed or hung?
Check connection pool exhaustion Max connections hit? Zombie connections?
Review DB and system logs OOM kill, disk full, auth errors, deadlock?
Open vendor P1 ticket if cloud DB Do not wait — open in parallel with triage
MIM posts first milestone T+15 minutes. CAN format. Stakeholder view updated.

Needs / Escalation

DB admin access confirmed? If not — who has it? Get them on the bridge now.
Failover runbook located? Where is it? Who owns it? Is it current?
Leadership aware? E2 posture if customer-facing. Exec SMS trigger.
Backup availability confirmed? Last backup time? Restore tested? RTO/RPO known?
Security / compliance notified? If data may be compromised or inaccessible to regulators.
Read replica available? Can traffic fail over to read replica temporarily?
MIM: update exposure line Every 15 min. If R(U) → escalate immediately.

Escalation Path

App SME

→

DB / Ops Chief

→

MIM

→

Engineering Director

→

CTO / Exec

Trigger escalation up the chain when: recovery path is unknown (R(U)) for >15 minutes, vendor has not engaged within 30 minutes, or data integrity risk is identified.

Response Timeboxes

T+5

Bridge Open / Command Assigned

MIM active. All required responders on call or paged.

T+10

Conditions Gathered

All teams have reported CAN to MIM. Scope confirmed.

T+15

First Milestone Published

Stakeholders updated. Exposure notation published.

T+20

Recovery Track Opened

Primary recovery path identified. Assignee and timebox set.

T+30

Customer Alert Decision

PIO issues customer alert or MIM documents why not.

Release Criteria

Responders are not released until MIM confirms all of the following:

✓

Database is confirmed reachable from all application servers
✓

Application teams have validated their services are operating normally
✓

Monitoring alerts have cleared or been acknowledged with explanation
✓

Customer impact has been assessed — alert issued or documented as not triggered
✓

Root cause known or formally deferred to Learning Review
✓

MIM has posted the resolution milestone with duration and known cause
✓

After Action scheduled (within 5 business days for P1)