Chapter 09

Can We Ship This Safely?

Safe shipping is detection speed plus rollback speed — feature flags, canary rollouts, and blue/green deployments through the blast window lens.

⏱ 16 min read 🧭 Decision

The 45-minute company

August 1, 2012. The NYSE opens at 9:30 AM ET on the day the SEC's new Retail Liquidity Program goes live, and Knight Capital — one of the largest market-makers in US equities — has new code in production to handle it. Within minutes, something is very wrong. Orders are flooding out in volumes nobody asked for, in stocks nobody meant to trade, at prices that make no economic sense. By 10:15 AM, Knight has bought roughly seven billion dollars of stock across 154 NYSE-listed names. The firm's pre-incident market cap was around four hundred million.

The deploy had gone out to eight production servers. It landed on seven. The eighth still ran the old binary — and a configuration flag that the new code used for one purpose was, on that old binary, the trigger for a deprecated routing path called Power Peg that nobody had run in years. When the market opened, the eighth server interpreted "go" as "fire the dead code," and the dead code started routing orders as fast as the wire allowed.

Nothing stopped it for forty-five minutes. There was no staged rollout that would have caught one server out of eight behaving differently from the other seven. There was no automated rollback wired to the kinds of metrics that would have screamed inside the first minute — order volume, P&L, inventory drift. The kill switch existed as a feature flag, and toggling it did stop the seven good servers — but the rogue server interpreted the same flag the old way, and you can't turn off code that doesn't know the switch is for it. By the time humans understood what they were looking at, Knight had lost about four hundred and sixty million dollars and was, in any meaningful sense, no longer a company. Getco acquired the husk in December.

Every safeguard that wasn't there would have collapsed the blast window from 45 minutes to under one. The rest of this chapter is about which safeguards move which number.

The blast window

Every outage you'll ever ship breaks down into the same two intervals — how long it takes to notice, and how long it takes to stop. Name them now; the rest of the chapter is about which engineering choices move which number.

1. Time to detect (TTD)

TTD is the clock from the moment a regression reaches production to the moment your team knows about it. It's a function of monitoring sensitivity, human attention, and how much traffic the bad code is seeing — not how smart anyone on the team is. A noisy regression on a top-of-funnel surface — checkout button stops working, conversion drops 40% — gets a 2-minute TTD because the dashboards scream. A subtle pricing bug on a niche surface — wrong tax math for one country, one product line — can sit undetected for 6 hours because nothing crosses an alert threshold. The factors that move TTD are concrete and budgetable — alert thresholds, metric coverage, canary exposure, and who's actually watching the dashboard during the rollout window.

2. Time to rollback (TTR)

TTR is the clock from "we know" to "the regression no longer reaches users." It's a function of rollback method, not team heroism. The spectrum is wide — a feature flag flip lands in seconds, a blue-green switch in roughly 30 seconds, a config revert in a minute, a version rollback in 5 to 30 minutes, a hotfix deploy in 30 to 60, and a database migration unwind in hours if it's possible at all. The PM-shaped question for any launch is which method is actually on the table for this change. "We can hotfix it" is a 30-minute number, not a vibe — and a launch with only hotfix on its rollback path has a wider blast window than the team is telling you.

3. The blast window

The blast window is TTD + TTR — the duration during which the regression is actively reaching users. The companion number is the blast radius: how many users are affected during that window. "What's our blast window?" is a quantitative question with an answer that fits in one cell of a Slack message — minutes. "What's the blast radius?" follows immediately, and it's also a number. Knight Capital's TTD was effectively their TTR — by the time they understood what was happening, the damage was done. Blast window: 45 minutes. Blast radius: every share they traded that morning.

BLAST WINDOW = TTD + TTR

Every safeguard moves one of those two numbers.

The rollout instrument cabinet

The blast window is a number you can shrink. Four techniques compress it, each acting on a different part of the window — some on TTD, some on TTR. The rest of this section is the cabinet — what each tool moves, and what it doesn't.

Feature flags

A feature flag is a runtime switch decoupling deploy from release. Code ships dark, the flag flips it live, and the flip's reversible in seconds. That's a TTR collapse — "redeploy the last good binary" down to "set a boolean." It doesn't lower TTD; a flag-gated rollout nobody's watching breaks as quietly as an ungated one. The PM-visible question on every flag is when does it come off? A flag with no cleanup date becomes a dark feature flag — silent branching logic the team trips over a year later, tech debt the PM helped create. Chapter 13 revisits this.

Canary rollouts

A canary releases new code to a small cohort first — 1 to 5% of traffic — before the rest of the fleet sees it. The number this moves is TTD. The canary cohort hits the regression first, and attention runs higher during the canary window, so a bad change surfaces against a smaller blast radius and a louder dashboard. It doesn't lower TTR. The trade is time for compression — a 24-hour canary delays rollout by 24 hours and buys a typical several-fold cut to detection time, depending on cohort size and dashboard discipline, on regressions that'd otherwise ship silently to 100%.

Blue/green deployments

Blue/green runs two identical production environments. One serves live traffic, the other holds the new release, and the cutover is a router flip. For code changes, that flip collapses TTR to near-zero — the old environment's still running, healthy, and rollback is the same flip in reverse. It doesn't help with data-state changes; once a migration has rewritten rows in the shared database, flipping the router back doesn't unwind them. Migration unwinding is a separate problem, out of scope. Blue/green earns its keep at scale.

Rollback automation

Rollback automation is the technique whose TTR scales directly with investment. A documented runbook is the manual end — paged engineer, checklist, ~30 minutes. A CI-driven revert plus redeploy lands at ~10 minutes. A flag flip on a pre-instrumented surface clears in a second. The PM-shaped read isn't "more automation is better." It's which level does this surface warrant, given its TTD? A 30-second TTD doesn't need a 30-minute rollback — that blast window is TTR-dominated, and the investment goes there.

PM Insight

Asking "what's the blast window if this goes wrong?" before tagging is the single most useful PM move on shipping safety. The team's answer tells you whether the plan exists.

The four techniques compose. To see how they trade off, try the combinations below — pick a stage and scenario, then tweak the parameters and watch the blast window recompute.

Try it: rollout strategy → blast window

Pick a scenario, then tweak the inputs to see how each safeguard moves the blast window.

Tune the assumptions

Daily active users

Rollout strategy

Canary cohort %

Canary duration (hours)

Monitoring sensitivity

Rollback method

TTD

—

TTR

—

Blast window

—

Blast radius

—

balanced

Designing the rollout plan

The blast window is a target. The four techniques compress it. The PM's job is the conversation that names the target and picks the techniques — before the deploy gets tagged.

The acceptable blast window

Before picking safeguards, name the target. "Our blast window for this surface should be under five minutes" — a number the team can engineer toward, not a wish. A B2B enterprise auth flow lives or dies on ten seconds; longer than that and a customer's workforce is locked out. An internal admin tool can take fifteen minutes without anyone outside the building noticing. A consumer flagship feature lands near one minute, with monitoring tightened on the canary cohort.

Matching technique to feature

Each safeguard has a fit. A pure code change with a stable contract runs cleanly on blue/green — the router flip is the rollback. A behavior change with reversible side effects — a new ranking, a new pricing rule — wants a feature flag. A risky algorithmic change with measurable side effects — a new recommender, a new fraud model — wants canary plus flag. Database migrations aren't in scope here; schema-change rollouts have their own playbook — expand/contract, dual-writes, zero-downtime migrations. AI-assisted observability — anomaly detection across log streams, LLM-summarized error clusters — is starting to compress TTD measurably. The PM-shaped question of what blast window's acceptable for this feature hasn't changed; AI just shrinks one of the two numbers in the equation.

What you sign off on

The PM signs off on the plan, not the implementation. The plan names the target blast window, the rollout strategy, the canary percentage and duration if one applies, the rollback method, the monitoring criteria, and the decision points — what triggers escalation, and who's paged when it fires. In practice, that's a one-paragraph rollout plan posted in Slack before the release gets tagged. Visible. Auditable. It forces the named conversation and gives the on-call engineer at 2 AM a document to act from instead of a guess.

How this changes by stage

At finding fit: A feature flag is the only safeguard you need. Roll back with git revert. Building canary infrastructure before product-market fit is yak-shaving the safeguards that aren't yet load-bearing.

At operating at scale: Every launch has all four safeguards layered. Rollback automation is governed by change-management policy. The blast window is computed in advance against an SLA; "we'll see how it goes" isn't an answer.

When rollback is the answer

Rolling back beats pushing through more often than teams admit. The discipline is harder than it sounds — engineers in the middle of an incident want to fix what they can see, not undo what they already shipped. The moment to name that trade-off is the PM's.

The fix-forward trap

Ninety minutes into an incident, the temptation is loud: "I see the bug, let me push a fix." That's sunk-cost reasoning dressed up as urgency. Rolling back to a known-good state and debugging in calm is cheaper than continuing to debug in production while users feel the pain. The engineer deep in the stack trace won't be the one to call it — they're too close to the fix that's almost working. The PM-shaped move is to be the one who asks "should we roll back now?" before another forty minutes burn into a hotfix.

Rolling back gracefully

Rollback isn't free, even when it's fast. Data written during the blast window may need cleanup. Sessions started in the broken state may need invalidation. Outbound webhooks have already fired downstream, and you can't unsend them. None of that means rollback is the wrong call — it means the rollback has its own runbook, and that runbook needs to exist before the deploy ships. Plan the undo as carefully as the do.

Preserving state

The smallest unit you can roll back determines the blast radius of the rollback itself. Flag-level granularity is finest — one feature off, the rest of the release stays live. Commit-level rewinds one change. Deploy-level rewinds everything that shipped together, including the four things that were fine. The PM-visible question, before any rollout begins: what's the smallest unit you'd roll back — the flag, the commit, or the deploy?

Detection time plus rollback time equals what each launch costs you when it's wrong. The rest is just engineering.

PM Playbook — Questions to ask

The next time a rollout plan lands on your desk, try these:

"What's the blast window if this goes wrong?" — Forces the team to compute TTD + TTR before ship, not after the incident.
"What's our TTD on this specific surface?" — Forces a number, not a vibe. If nobody knows, monitoring hasn't been calibrated for this feature.
"What's the rollback method, and how long does it actually take?" — Names the TTR. "We can hotfix" isn't a TTR; "the deploy pipeline takes 22 minutes from merge to prod" is.
"What rollout strategy did we pick, and why?" — Forces articulation of the choice (full / canary / blue-green / flag-with-ramp) rather than defaulting.
"Is the rollback path on a different system than the feature?" — Knight Capital's lesson encoded. If rollback depends on the broken system, the rollback isn't real.
"When does the flag come off?" — Flags left forever become dark feature flags — accumulated tech debt the PM helped create.
"What's the smallest unit we'd roll back — the flag, the commit, or the deploy?" — Tests the granularity of the rollback plan; finer granularity = lower blast radius.

Check your understanding 4 questions