Back to blog
Product

How We Built Incident Grouping for Dead Man's Switch Monitoring

Matt5 min read

The Problem We Kept Hearing About

TL;DR: Incident grouping uses time-window correlation to automatically group monitor failures that occur within the same period into a single incident. Instead of receiving 49 separate alert emails during a database outage, you get one incident notification listing all affected monitors with a shared timeline and auto-generated postmortem.

Once users got past about 30 monitors on DeadPing, the same feedback started coming in: "When something goes wrong, my phone explodes." A database outage would take down 15 background jobs, and they'd get 15 separate alerts. A network blip would cause 20 monitors to miss their pings within the same minute, and the on-call engineer would get 20 notifications that all meant the same thing.

We needed to build incident grouping. But there are a few different ways to approach it, and the tradeoffs matter.

Three Approaches We Considered

Tag-Based Grouping

The first idea was to let users tag monitors (e.g., "database", "us-east-1", "billing-team") and group alerts by tag. If three monitors tagged "database" go down, group them.

The problem: it requires users to set up and maintain tags correctly. In practice, tags drift. New monitors don't get tagged. The grouping silently degrades as the monitor list grows. We wanted something that works without configuration.

ML-Based Clustering

We looked at using historical correlation data to automatically cluster monitors that tend to fail together. It's a cool idea in theory -learn that billing-sync and invoice-generator always fail at the same time because they share a database, then group them automatically.

The problem: you need a lot of failure data to build reliable clusters, and most monitors (hopefully) don't fail that often. Cold-start is terrible. And the model can't handle novel failure modes -a type of outage it hasn't seen before won't match any existing cluster. We filed this under "maybe later, when we have more data."

Time-Window Grouping

The approach we shipped: if multiple monitors go down within a 5-minute window, they're part of the same incident. No configuration required. No training data needed. It works because the underlying insight is simple: correlated failures happen at the same time. A network outage doesn't take down one service now and another one 30 minutes later. Everything fails together.

Is it perfect? No. Two genuinely unrelated failures that happen to occur within 5 minutes of each other will get grouped together. But in practice this is rare, and a single incident with a false positive is still better than a flood of individual alerts.

How the Flow Works

Here's what happens when a monitor misses its expected ping:

1. Monitor "billing-sync" misses its ping
   → Check: any open incident created in the last 5 minutes?
   → No → Create new incident, add billing-sync, send "incident opened" alert

2. Monitor "invoice-generator" misses its ping (2 minutes later)
   → Check: any open incident created in the last 5 minutes?
   → Yes → Add invoice-generator to existing incident, suppress individual alert

3. Monitor "payment-processor" misses its ping (1 minute later)
   → Same → Added to incident, individual alert suppressed

4. All three monitors receive pings again (recovery)
   → All monitors in incident have recovered
   → Close incident, send "incident resolved" alert with summary

The key decision is what happens at step 2: instead of sending a new alert, we attach the monitor to the existing incident and suppress the individual notification. The on-call engineer already knows something is wrong from the first alert. The additional monitors are tracked in the incident detail view but don't generate noise.

The Notification Difference

Without incident grouping, a database outage affecting 15 monitors produces 30 notifications: 15 "monitor down" and 15 "monitor recovered." With incident grouping, you get 2: one "incident opened" and one "incident resolved."

# Without grouping: 30 notifications
[DOWN] billing-sync missed expected ping
[DOWN] invoice-generator missed expected ping
[DOWN] payment-processor missed expected ping
...12 more DOWN alerts...
[UP] billing-sync is back up
[UP] invoice-generator is back up
...12 more UP alerts...

# With grouping: 2 notifications
[INCIDENT] 15 monitors down – billing-sync, invoice-generator, +13 more
[RESOLVED] Incident resolved after 8 minutes – 15/15 monitors recovered

The resolved notification includes a link to the incident detail page, which has everything you need for a postmortem without digging through Slack history.

Auto-Generated Postmortems

Every incident automatically gets a summary page with three things: total duration, the list of affected monitors, and a timeline showing exactly when each monitor went down and when it recovered. That's the data you'd normally spend 30 minutes reconstructing from alert timestamps after the fact.

On top of the auto-generated data, we added two editable fields: root cause and action items. These are free-text fields where the on-call engineer can document what happened and what needs to change. They're optional, but having them right next to the timeline means the postmortem gets written while the incident is fresh -not three days later when half the details are forgotten.

Incident #INC-2026-0042
Duration: 8 minutes (03:14 – 03:22 UTC)
Affected monitors: 15

Timeline:
  03:14  billing-sync         DOWN
  03:14  invoice-generator    DOWN
  03:15  payment-processor    DOWN
  03:15  subscription-renewal DOWN
  ...11 more...
  03:22  All monitors         RECOVERED

Root cause: (editable)
  Database connection pool exhausted after deploy removed
  connection timeout setting.

Action items: (editable)
  - Restore connection pool timeout in production config
  - Add connection pool utilization to Grafana dashboard
  - Set up database connection count alert at 80% capacity

Why Business Tier Only

Incident grouping is a Business tier feature ($39/mo). The reasoning is straightforward: if you're running enough monitors that alert fatigue is a problem, you're running serious infrastructure. The teams that need this are running 50+ monitors across multiple services, and the cost of a single mishandled incident -the engineer time, the extended downtime, the postmortem scramble -dwarfs the subscription cost.

Free and Pro tiers still get individual alerts for every monitor. That's the right default for smaller setups where you want to know about every failure independently. Incident grouping is for when "every failure independently" becomes noise instead of signal.

Check out the pricing page for the full tier comparison, or read about why alert fatigue matters for the problem this solves.

Start monitoring in 60 seconds

Free forever for up to 20 monitors. No credit card required.

Get Started Free