Chapter 07

Why Did It Break?

Reliability, incidents, on-call, SLAs, blameless postmortems — and why every outage is downstream of a decision you made six months ago.

⏱ 15 min read 🧭 Decision

The day half the internet stopped responding

July 2, 2019. At 13:42 UTC, Cloudflare pushed a routine update to its Web Application Firewall — the rule set that protects a meaningful slice of the internet from bots and attacks. The deploy was unremarkable. They'd shipped hundreds like it.

Seconds later, CPU on every machine running the WAF went to 100%. Globally. Cloudflare's edge started returning 502s for traffic worldwide. For about 27 minutes, a meaningful slice of the internet didn't respond.

The proximate cause was a single regex in the new rule set. It had what engineers call catastrophic backtracking — on certain inputs, the matching algorithm explodes into work that never finishes. The regex pinned the CPU. The WAF couldn't process any other traffic. The deploy was the trigger you can name in one sentence.

The interesting part is what didn't catch it. WAF rules deployed globally, in one shot, with no staged rollout. That wasn't an oversight — it was a deliberate call to keep security rules consistent everywhere. On a normal Tuesday the tradeoff cost nothing. On this Tuesday the bad rule landed on every machine at once.

The WAF runtime had no CPU time limit on regex execution. Adding one had been on the table — it's a standard guardrail — and it hadn't been built yet. A regex that ran for 50 milliseconds and a regex that ran forever looked the same to the system.

And there wasn't a kill switch. Disabling the WAF required a config push, and that push went through the same edge now sitting at 100% CPU. Turning it off depended on the thing being on.

Any one of those three — staged rollout, regex timeout, independent kill switch — would have turned a global outage into a regional blip. None of them were forgotten.

The deploy was the trigger. Every other thing that should have made it small was a decision.


The anatomy of an outage

That pattern isn't Cloudflare's. It's every outage. Once you've sat through a few postmortems, you'll see the same three parts every time, in the same order.

1. The trigger

The trigger is what people point at in the first hour — the bad deploy, the partner API that returned malformed JSON on the third Friday of the month, the database migration script with the off-by-one, the rare input that nothing in staging ever produced. It's narrow, fixable in an afternoon, and almost never the interesting part of the story. A regex. A null check. A typo in a config file. The trigger is the sentence at the top of the incident summary, and it's the part of the outage you'll forget by next quarter.

2. The deferred safeguard

The interesting part is the thing that would've kept the trigger small. There's almost always one — sometimes three — and they share a profile. They were known. Someone had named them. They lived somewhere written down: a backlog ticket, a postmortem action item from the last incident, a Slack thread, a doc nobody reopened. And they'd been deferred at least twice, each time for a reason that didn't look like a reliability decision in the moment. Cloudflare's three were a staged-rollout system, a regex compute-time limit, and an independent kill switch. All three were known. None were built. That's the shape. The trigger surprises you. The safeguard doesn't — it's sitting in your own backlog.

3. Blast radius

Blast radius is how far the outage reached, and it's wider than the user-impact number that lands in the customer comms. It includes the engineering time the team burns on remediation and the postmortem week that follows. It includes the roadmap impact — the features that slip because the team is mid-cleanup instead of mid-build. And it includes trust: the all-hands question, the customer escalation thread, the exec who now wants weekly reliability updates. One customer, one feature, one region, global — these are different outages with different price tags, and "what's the blast radius?" is a real question with a calibrated answer.

EVERY POSTMORTEM IS A ROADMAP DOCUMENT

It tells you what the team has been deferring.


Reading an outage like a PM

The shape of a postmortem matters more than the trigger it points at. Read it for how it assigns cause, what it counts as a fix, and what happens to the fixes after the doc gets filed. That's where the roadmap is hiding.

Blameless postmortem

The label is misleading. The point isn't that nobody gets blamed — it's that blame doesn't surface what actually broke. A postmortem that lands on "engineer X pushed bad code" stops the analysis at engineer X. A postmortem that says "the deploy system shipped without a canary and without a regex compute-time limit" surfaces two safeguards you can fund.

The PM-visible read on a healthy postmortem culture is unglamorous: the action-item lists are longer. Blameless writeups cost more to produce because they keep asking past the obvious answer, and they produce more to ship because every contributing factor is a candidate fix. If your team's postmortems are short and tidy, that isn't a sign they're efficient. It's a sign they're stopping early.

Root cause vs. contributing factors

Real outages have a chain of conditions, not a single bad actor — and a postmortem that lands on one root cause is one that stopped asking too early. The 5-whys and fishbone exercises exist for that reason. They surface the contributing factors that single-cause framing misses.

Your read on this is fast. When you see "Root cause: X" with no contributing-factors list, the right question isn't "what's the fix for X." It's "what else was true at that moment?" Cloudflare's regex would've been written up as "Root cause: bad regex" in a single-cause culture. In a contributing-factors culture, it's three deferred safeguards in a list. Same incident, different roadmap implications — and the contributing-factors version is the one that changes what the team builds next.

Action items that survive

Action items are easy to write. The test isn't whether the postmortem produced them. The test is whether they shipped.

Pull the action items from the last three postmortems on a system and check how many made it to production. That ratio is the actual signal about whether reliability work is real or theatrical at your company. Theatrical action items are easy to spot once you look — they're vague ("improve monitoring"), unowned, or duplicated from a postmortem two quarters ago. The ones that survive are specific, owned, and dated. Grade the postmortem ninety days later, by the survival rate of what it promised.

PM Insight

Asking "show me last quarter's open postmortem action items" is the single highest-leverage move you have on reliability. You don't need to evaluate them. You just need them to exist and to be visible.


The error-budget bargain

Every team trades reliability for velocity. Most teams don't know they're trading — the swap happens in planning meetings, framed as scope and timing. The vocabulary in this section is what lets you see the trade while it's happening.

SLO vs. SLA

An SLO is a service-level objective — what the team promises itself. 99.9% availability, p95 latency under 300ms. Internal target, no contract. An SLA is a service-level agreement — what's promised to the customer, usually with money attached: credits, refunds, the enterprise renewal. Most teams have neither written down. They have an implicit SLO they discover the morning after an incident, when somebody says "this can't happen again." Your read is one question — which one is in play? If nobody can name either, you've found the chapter's argument in the wild.

Error budget

The error budget is the complement of the SLO. If you promise 99.9% availability, you're saying you accept up to 0.1% downtime — about 43 minutes per month, or 8.76 hours per year. That's the budget, and you're going to spend it.

What it does is convert "ship the new feature or fix the flaky ingestion service?" from a recurring argument into a decision rule. While there's budget left, the team ships. When the budget is spent, the team stops and fixes — not an escalation, just the rule everyone agreed to.

The part PMs miss: you already have an error budget, and you're already spending it. The only question is whether you're deciding what it gets spent on, or whether it gets spent on whatever pages on Tuesday. Error budget is the lever you didn't know you were pulling every quarter.

The unspoken bargain

The bargain is the trade itself: every minute of reliability work is a minute of feature work the team doesn't ship, and every shipped feature is a minute of reliability work the team didn't do. That trade is happening already. Without an SLO and a budget, it's happening in the background — decided by whoever has the loudest argument at planning, or by whichever fire is closest to the keyboard. With them, the trade becomes a roadmap question with a real answer. You get a seat at the reliability table you didn't realize you were eligible for.

How this changes by stage

At finding fit: You don't have SLOs and that's fine — every outage is small and users tolerate them. The bargain still exists; it just isn't named yet.

At operating at scale: Error budget is the contract that decides which feature ships and which one waits. It is the most consequential PM lever in the org, and you should know what yours is.


On-call as a signal

The on-call channel is the cheapest leading-indicator signal of reliability decay a PM has. It runs ambient — no postmortem required, no incident required — and you can read it before the next big one lands.

Toil

Toil is operational work that scales with usage. The term comes from SRE, and it's PM-useful as a category. If toil grows faster than the team, you've got a problem the team is already paying for — they're just paying in engineering hours instead of budget. The PM-shaped read is one question: which on-call tasks have shown up twice in the last quarter? Anything that pages a human twice is a candidate to automate — and the runbook for it — the doc the on-call engineer follows when the alert fires — is where toil gets prevented or repeated. The prioritization call lands on your desk, not theirs.

Pages per week

Trend matters more than the absolute number. A doubling quarter-over-quarter is a roadmap signal before it's an incident. Don't ask "how many pages did we get this week?" Ask "is it higher than last quarter, and which system is it from?" Pages cluster, and the cluster tells you where the next big incident is brewing. That's cheaper than waiting for it to land.

The orphan system

The orphan is the codebase no current team member built — the service that survived two reorgs, the integration the original author left two years ago. Orphan systems page more, fix slower, and get deferred faster than anything else on the board. Nobody owns them with the conviction of the person who wrote them. AI-assisted code archaeology helps with the discovery, but the underlying question — rebuild, replace, or keep paying the tax — is still PM-shaped, and that call is yours.

When a system breaks, you're not learning what's wrong with the system. You're learning what your team has been deferring.


PM Playbook — Questions to ask

The next time a real incident lands on your team, try these:


4 questions