ArchCode
Free template · Incidents

Incident Runbook Template

A structured format your team can fill in for each known failure mode, plus a "first 30 minutes" playbook for on-call responders.

← All templates

Copy the templates below into your team wiki, Notion, Confluence, or a Markdown file in your repo. Fill in the bracketed fields for each failure mode your team has seen before. Aim for 3–5 runbooks to start — these are the ones that will matter during an incident.

Template 1 — First 30 Minutes Incident Playbook

This is the shared starting point for any incident, regardless of the specific failure. Paste this at the top of your incident doc or Slack thread.

# First 30 Minutes — Incident Playbook ## 0–5 minutes — Acknowledge and communicate - [ ] Post in #incidents: "Investigating [brief description]. Owner: [your name]." - [ ] Confirm the scope: is this affecting all users, some users, or internal only? - [ ] Check the status page and update if user-facing ## 5–15 minutes — Diagnose - [ ] Check the error dashboard: spike in 5xx errors? Latency increase? Availability drop? - [ ] Check logs for the affected service: look for repeated error messages - [ ] Check recent deploys: anything shipped in the last 2 hours? - [ ] Check infrastructure events: restarts, scaling events, upstream issues? ## 15–25 minutes — Decide to fix or rollback - [ ] If cause is known and fixable in under 10 minutes: fix forward - [ ] If cause is unknown or fix will take more than 10 minutes: rollback - [ ] Rollback procedure: [link to rollback runbook] ## 25–30 minutes — Communicate and stabilise - [ ] Post an update in #incidents with current status and expected resolution - [ ] Confirm the incident is resolved or still active - [ ] Assign post-mortem owner ## After resolution - [ ] Write a short post-mortem (5 minutes, not a blame doc) - [ ] Add/update the relevant runbook with what you learned - [ ] Fix the underlying cause if rollback was used

Template 2 — Single Failure Mode Runbook

Create one of these per known failure mode. Keep them short — the goal is to cut diagnosis time, not write documentation for its own sake.

# Runbook: [Failure Mode Name] ## e.g., "Database connection pool exhausted" or "API latency spike" Last updated: [date] Owner: [team or person] ## Symptoms - [What does it look like in the dashboard?] - [What error messages appear in logs?] - [What do users report?] ## Likely causes 1. [Most common cause] 2. [Second most common cause] 3. [Less common cause] ## Diagnosis steps 1. Check [dashboard / metric] for [specific indicator] 2. Run: `[command or query]` 3. Look for: [what to look for in output] ## Fix steps ### Option A — Quick fix (under 5 minutes) 1. [Step 1] 2. [Step 2] ### Option B — If quick fix does not work 1. [Rollback or escalation step] ## Verification - [ ] [Metric] is back to normal range - [ ] No new errors in logs - [ ] Users confirmed unaffected ## Prevention - [What change would prevent this recurring?] - [Linked ticket or PR if fix is in progress]

Want a review of your current incident readiness?

The Free Delivery Review covers observability, alerting, and whether your team has the runbooks and dashboards they need to diagnose incidents fast.

Get a Free Delivery Review