>
MAJOR
ops
RC-DB-001
v1.0
TIER 3 · SPECIFIC APP
BOX 3
P1 — CRITICAL
Initial Exposure
T3C(~)R(~)
Typical at open — 3 teams engaged, confidence and recovery path not yet assessed.
Update every 15 minutes. Publish on bridge and in incident chat.
| Role |
Who |
Required |
First Action |
| MIM / IC |
Certified Major Incident Manager |
Required |
Open incident, assign command, declare Box 3 |
| Operations Chief |
Senior DBA or Infrastructure Lead |
Required |
Triage DB connectivity, confirm scope |
| App SME |
Application owner(s) for affected services |
Required |
Confirm app-layer impact, check retry behavior |
| Network/Infra |
Network or cloud infrastructure engineer |
Required |
Check network path, DNS, firewall rules |
| PIO / Comms |
Customer Communications Lead |
Required |
Draft initial customer alert, update status page |
| Security / SO |
InfoSec or Compliance Lead |
If data exposure risk |
Assess data breach risk, advise on recovery path safety |
| Vendor / LO |
Cloud provider or DB vendor contact |
If cloud/vendor DB |
Open P1 support ticket, get on vendor bridge |
-
Which database(s)?
Primary, replica, both? Specific host?
-
When did it go unreachable?
First alert time vs. confirmed outage time
-
What can connect and what can't?
App servers? Read replicas? Admin access?
-
Any recent changes?
Deployments, schema migrations, infra changes, cert rotations?
-
Error message or log snippet?
Connection timeout? Auth failure? Port unreachable?
-
Customer impact confirmed?
Which features / customers / regions?
-
Is this a cloud-managed DB?
RDS, Cloud SQL, Atlas, Azure SQL, etc.?
-
MIM opens incident
Severity: P1. Alarm Level: Box 3. Assign command.
-
Ping DB host directly
Network vs. DB process — distinguish quickly
-
Check DB process status
Is the DB process up? Has it crashed or hung?
-
Check connection pool exhaustion
Max connections hit? Zombie connections?
-
Review DB and system logs
OOM kill, disk full, auth errors, deadlock?
-
Open vendor P1 ticket if cloud DB
Do not wait — open in parallel with triage
-
MIM posts first milestone
T+15 minutes. CAN format. Stakeholder view updated.
-
DB admin access confirmed?
If not — who has it? Get them on the bridge now.
-
Failover runbook located?
Where is it? Who owns it? Is it current?
-
Leadership aware?
E2 posture if customer-facing. Exec SMS trigger.
-
Backup availability confirmed?
Last backup time? Restore tested? RTO/RPO known?
-
Security / compliance notified?
If data may be compromised or inaccessible to regulators.
-
Read replica available?
Can traffic fail over to read replica temporarily?
-
MIM: update exposure line
Every 15 min. If R(U) → escalate immediately.
App SME
→
DB / Ops Chief
→
MIM
→
Engineering Director
→
CTO / Exec
Trigger escalation up the chain when: recovery path is unknown (R(U)) for >15 minutes, vendor has not engaged within 30 minutes, or data integrity risk is identified.
T+5
Bridge Open / Command Assigned
MIM active. All required responders on call or paged.
T+10
Conditions Gathered
All teams have reported CAN to MIM. Scope confirmed.
T+15
First Milestone Published
Stakeholders updated. Exposure notation published.
T+20
Recovery Track Opened
Primary recovery path identified. Assignee and timebox set.
T+30
Customer Alert Decision
PIO issues customer alert or MIM documents why not.
Responders are not released until MIM confirms all of the following:
-
✓
Database is confirmed reachable from all application servers
-
✓
Application teams have validated their services are operating normally
-
✓
Monitoring alerts have cleared or been acknowledged with explanation
-
✓
Customer impact has been assessed — alert issued or documented as not triggered
-
✓
Root cause known or formally deferred to Learning Review
-
✓
MIM has posted the resolution milestone with duration and known cause
-
✓
After Action scheduled (within 5 business days for P1)