← All templates
Copy the templates below into your team wiki, Notion, Confluence, or a Markdown file in your repo. Fill in the bracketed fields for each failure mode your team has seen before. Aim for 3–5 runbooks to start — these are the ones that will matter during an incident.
Template 1 — First 30 Minutes Incident Playbook
This is the shared starting point for any incident, regardless of the specific failure. Paste this at the top of your incident doc or Slack thread.
# First 30 Minutes — Incident Playbook
- [ ] Post in #incidents: "Investigating [brief description]. Owner: [your name]."
- [ ] Confirm the scope: is this affecting all users, some users, or internal only?
- [ ] Check the status page and update if user-facing
- [ ] Check the error dashboard: spike in 5xx errors? Latency increase? Availability drop?
- [ ] Check logs for the affected service: look for repeated error messages
- [ ] Check recent deploys: anything shipped in the last 2 hours?
- [ ] Check infrastructure events: restarts, scaling events, upstream issues?
- [ ] If cause is known and fixable in under 10 minutes: fix forward
- [ ] If cause is unknown or fix will take more than 10 minutes: rollback
- [ ] Rollback procedure: [link to rollback runbook]
- [ ] Post an update in #incidents with current status and expected resolution
- [ ] Confirm the incident is resolved or still active
- [ ] Assign post-mortem owner
- [ ] Write a short post-mortem (5 minutes, not a blame doc)
- [ ] Add/update the relevant runbook with what you learned
- [ ] Fix the underlying cause if rollback was used
Template 2 — Single Failure Mode Runbook
Create one of these per known failure mode. Keep them short — the goal is to cut diagnosis time, not write documentation for its own sake.
# Runbook: [Failure Mode Name]
Last updated: [date]
Owner: [team or person]
## Symptoms
- [What does it look like in the dashboard?]
- [What error messages appear in logs?]
- [What do users report?]
## Likely causes
1. [Most common cause]
2. [Second most common cause]
3. [Less common cause]
## Diagnosis steps
1. Check [dashboard / metric] for [specific indicator]
2. Run: `[command or query]`
3. Look for: [what to look for in output]
## Fix steps
1. [Step 1]
2. [Step 2]
1. [Rollback or escalation step]
## Verification
- [ ] [Metric] is back to normal range
- [ ] No new errors in logs
- [ ] Users confirmed unaffected
## Prevention
- [What change would prevent this recurring?]
- [Linked ticket or PR if fix is in progress]
Want a review of your current incident readiness?
The Free Delivery Review covers observability, alerting, and whether your team has the runbooks and dashboards they need to diagnose incidents fast.
Get a Free Delivery Review