Alert Fatigue Is a Risk, Not an Annoyance

When monitoring becomes noise, you lose signal without losing the bill. The hidden cost of tolerating a cluttered alert channel.

Alert fatigue gets discussed like it's a morale problem. "The team's getting tired of the noise." "People are muting channels." Annoying, sure. Something to fix when there's time.

It isn't a morale problem. It's a systemic risk. A monitoring system that produces alerts nobody acts on is indistinguishable from no monitoring at all — but more expensive, and with a false sense of coverage.

This is the part that gets missed. You don't stop paying for monitoring when people stop trusting the alerts. You just stop getting the benefit.

The term "alarm fatigue" originated in clinical medicine, where the stakes made the dynamic impossible to ignore. A 2013 paper in AACN Advanced Critical Care documented that hospital monitors can produce over 350 alarms per patient per day, of which only about 0.5% indicate life-threatening events — and that patient deaths have been directly attributed to staff desensitisation[1]. The operations version of the same phenomenon is less acute but structurally identical.

The quiet failure mode

Here's how it actually unfolds at a small business:

Monitoring is implemented. Alerts go to a Slack channel. For the first month, people watch the channel and respond to everything. It works.

Over the next six months, the alert volume creeps up. New services, new thresholds, a few retrospective "we should have alerted on that" additions after incidents. The channel goes from a few messages a day to dozens.

Somewhere around month four, the team's mental model shifts. Instead of "every alert means something," it becomes "most alerts don't need action." People start scanning rather than reading.

By month seven, a real alert fires during a minor incident. It's buried under twelve noisy ones from the same hour. Nobody notices for three hours. Customers call.

The incident is a reminder, and there's a short burst of cleanup activity. But within a few months, alert volume has drifted back up.

This isn't a hypothetical. This is the default trajectory of alerting at any organisation that doesn't actively fight it.

Why it's more than annoyance

Treating alert fatigue as a comfort issue underestimates what's being lost:

Coverage erosion. The alerts exist, but humans have stopped acting on them promptly. You now have slower MTTR than your monitoring suggests.

Skill atrophy. People who triage dozens of non-incidents per day lose the pattern-matching they need for real incidents. When something genuinely weird happens, they've been trained to assume it's noise.

Recruitment damage. Engineers who've been burned by bad alerting recognise it in interviews. Ask any senior ops hire what they want in a job and "sensible on-call" is usually in the top three.

Audit problems. If you ever have to walk someone through how you detect and respond to incidents — an enterprise prospect, a compliance auditor, an investor — "we get alerts but they're mostly ignored" is a difficult thing to say. If you don't say it, and they audit your process, they'll find it anyway.

Real-incident cost. The incident where a critical alert was missed in the noise will cost more than the cumulative effort of keeping the channel clean would have.

The difficulty of fixing it

The hard part of alert hygiene is that noise always looks defensible in isolation.

Every alert was added for a reason. Someone, at some point, thought "we really should know when this happens." Often there's a post-mortem that added it. Removing it feels like reopening the wound.

The framing that helps: an alert that isn't acted on isn't protecting you. If the last 20 fires of an alert produced no action, the alert is already not protecting you — you've just been paying the attention tax on it anyway.

This is close in spirit to the 3 AM alert problem, which covers the acute version of the same pathology. Alert fatigue is the chronic version: less dramatic, equally damaging, and slower to show up.

A practical audit process

A quarterly alerting review, done seriously, takes about two hours. It looks like this:

1

List every active alert

Pull the full inventory of everything that can fire. Not just the loud ones — everything.

2

Check fire history

For each alert, how many times has it fired in the last 90 days? Most monitoring tools track this.

3

Check action history

For each fire, was an action taken? Did someone acknowledge, investigate, or fix?

4

Categorise

Per alert: "fires and is acted on," "fires and isn't acted on," "rarely fires but matters."

5

Cut ruthlessly

The "fires and isn't acted on" category is noise. Delete or consolidate. This is the point of the exercise.

6

Document the survivors

Every alert that remains gets a short runbook: what it means, first check, who owns it.

Done on a quarterly cadence, this keeps the channel usable. Skipped for two years, it stops being fixable without an incident-driven reset.

The counterintuitive move: fewer alerts

The instinct when something slips through is to add more monitoring. This is almost always wrong. The failure wasn't insufficient coverage. It was insufficient signal.

A small, tight alert list with near-zero false positives is more protective than a sprawling one where every alert is 30% noise. This is what breaks first when nobody's looking territory — the silent failures that get past busy people on busy teams.

Google's SRE Workbook formalises this as alerting on SLOs rather than on symptoms: the goal is to page humans only when the user-facing service-level objective is genuinely at risk, not every time a component hiccups[2]. The principle holds at SMB scale even without formal SLOs — an alert should correspond to a real, user-affecting problem, not to a threshold somebody guessed at once.

The other counterintuitive move: aggressive consolidation. Five alerts on different aspects of the same underlying failure are often better replaced by one alert on the failure itself. "Database is unreachable" is more useful than separate alerts for connection pool, replication lag, query latency, and CPU — all of which will flare when the database goes down.

The relationship to team size

Smaller teams are more exposed to alert fatigue, not less. With a large on-call rotation, a noisy channel gets distributed noise — each person sees maybe a tenth of it. With a team of two, every alert goes to one of two humans, every time.

The pillar on boring IT for SMBs covers the full alerting picture for small teams. The short version for alert-fatigue specifically: at small scale, the cost of noise is disproportionately high. You have less capacity to absorb it, and every false alert is individually louder.

The test

The quickest way to tell if you have an alert-fatigue problem:

Open your alerting channel. Count the unread messages, or the alerts from the last 24 hours. If the number is bigger than you can comfortably triage in ten minutes — and you find yourself scrolling past most of them — the channel has already failed at its job.

The monitoring's still there. It's still expensive. But it isn't doing what you thought it was doing, and neither is anyone looking at it. Consolidating alerts into a single tiered channel — which is part of what Site Watcher does across domain, SSL, uptime, DNS, and vendor signals — is one way to keep the list small enough to stay honest.

References

  1. Sendelbach, S. & Funk, M. (2013). Alarm fatigue: a patient safety concern. AACN Advanced Critical Care, 24(4), 378–386. pubmed.ncbi.nlm.nih.gov/24153215.
  2. Google, Site Reliability Engineering Workbook, Chapter 5: Alerting on SLOs. sre.google/workbook/alerting-on-slos.