Faster incident diagnosis with dashboards, alerts, and runbooks
How a UK SaaS team cut mean time to diagnose from 3 hours to 18 minutes and reduced customer-reported incidents by 60%.
← All case studies
The problem
- Incidents took 2–4 hours to diagnose because logs were unstructured and scattered
- Alerts were either completely missing or too noisy to act on — engineers muted them
- No shared runbook — every engineer investigated incidents differently
- Customers often reported incidents before the team detected them internally
What we delivered
- Structured logging added to the application with consistent field schema
- 3 core dashboards: errors, latency, and availability — visible to the whole team
- Alert rules tuned to real user impact — false positives reduced by ~75%
- 5 runbooks for the most common failure modes, plus a "first 30 minutes" incident playbook
Results
Mean time to diagnose
3 hrs → 18 min
Alert noise reduced
~75% fewer alerts
Customer-reported incidents
Dropped by 60%
How it worked
- Signals audit — reviewed existing logs, metrics, and alerts; identified the biggest diagnosis blockers
- Define Done — agreed which failure modes needed runbooks and what "useful alert" meant for this team
- Implement — structured logging first, then dashboards, then alert tuning, then runbooks
- Handoff — walkthrough session, team owns dashboards and alert thresholds going forward
Want a similar result?
Start with a free signals audit of your current observability setup. We will tell you what's missing and send a clear fix plan — at no cost.