ArchCode
Case Study · Observability & Incident Readiness

Faster incident diagnosis with dashboards, alerts, and runbooks

How a UK SaaS team cut mean time to diagnose from 3 hours to 18 minutes and reduced customer-reported incidents by 60%.

← All case studies

Client

UK SaaS team (anonymised)

Team size

6–20 engineers

Industry

Multi-tenant platform

Package

Observability & Incident Readiness

Timeline

2.5 weeks

The problem

  • Incidents took 2–4 hours to diagnose because logs were unstructured and scattered
  • Alerts were either completely missing or too noisy to act on — engineers muted them
  • No shared runbook — every engineer investigated incidents differently
  • Customers often reported incidents before the team detected them internally

What we delivered

  • Structured logging added to the application with consistent field schema
  • 3 core dashboards: errors, latency, and availability — visible to the whole team
  • Alert rules tuned to real user impact — false positives reduced by ~75%
  • 5 runbooks for the most common failure modes, plus a "first 30 minutes" incident playbook

Results

Mean time to diagnose

3 hrs → 18 min

Alert noise reduced

~75% fewer alerts

Customer-reported incidents

Dropped by 60%

How it worked

  • Signals audit — reviewed existing logs, metrics, and alerts; identified the biggest diagnosis blockers
  • Define Done — agreed which failure modes needed runbooks and what "useful alert" meant for this team
  • Implement — structured logging first, then dashboards, then alert tuning, then runbooks
  • Handoff — walkthrough session, team owns dashboards and alert thresholds going forward

Want a similar result?

Start with a free signals audit of your current observability setup. We will tell you what's missing and send a clear fix plan — at no cost.