Chapter 08

Is This Ready to Ship?

"Good enough" is a calibration the PM owns — stage, audience, and reversibility decide the bar your team should ship against.

⏱ 15 min read 🧭 Decision

The wrong conversation

October 1, 2013. HealthCare.gov goes live as the centerpiece of ACA enrollment — the federal exchange the administration had been promising for three years, anchored to a date written into law. Within hours, the public-facing site is unusable. Error pages, infinite spinners, timeouts. The few users who do get through can't finish account creation. By the end of the first week, six people have completed enrollment on a system built to serve millions.

The team didn't ship something they believed was broken. They'd cleared their internal QA bar. Contractors had walked the happy paths. The demo worked. The whole readiness apparatus had been calibrated against a load that looked like dozens of concurrent users — engineers and reviewers clicking through flows in a staging environment. It wasn't malice or negligence. It was the wrong target.

What the launch required was a different bar entirely. Day-one traffic ran somewhere between two and eight million unique visitors. The audience wasn't internal testers; it was an entire country watching a signature policy land in real time, with brand and political stakes attached to every error page. And the date itself was law — once October 1 passed, there was no rolling back to a quieter beta. Reversibility was zero.

What HealthCare.gov shipped wasn't broken in the engineering sense. It met its bar. The bar was wrong — calibrated for the wrong stage, the wrong audience, and the wrong reversibility. The readiness review answered the question the team was asking. Nobody in the room was asking the question the launch was about to ask them.

"Ready" isn't a yes/no engineering report. It's a calibrated bar — and the calibration is the PM's question, not the team's.

The three dimensions of "good enough"

Every ship-readiness conversation collapses to the same three calibration dimensions: stage, audience, and reversibility. Name them once, and the rest of this chapter is just applying them.

1. Stage

Stage is where the company sits on the maturity curve, and it's the bar your everyday shipping calibrates against. The recurring frame has three modes — finding fit, scaling, operating — and each tolerates a different kind of rough edge. A finding-fit team ships to learn; an operating team ships without breaking the install base.

Launch is its own thing — not finding fit, not scaling, not operating: a one-time bar that overrides everyday calibration. A finding-fit startup launching its first paid product carries launch-mode stakes on top of finding-fit calibration. The PM-shaped read is which mode you're in: recurring (three-stage frame) or one-time (launch bar).

2. Audience

Audience is who's affected when you're wrong, and the bar moves with the blast radius. The tiers are recognizable: an internal admin tool used by twelve engineers — a B2B SMB product with roughly a thousand users and low contractual stakes — a B2B enterprise product with a hundred accounts and SLAs attached — a consumer product at trust, with millions of users and brand on the line.

The same code shipped to those four audiences needs four different bars. The team's default audience is themselves, because that's who they tested against. Your job is to name the actual audience before the readiness review calibrates to the wrong one.

3. Reversibility

Reversibility is how cheaply you can walk a change back, and it sets the price of being wrong. The spectrum runs from a feature flag in seconds, to a config flip in minutes, to a hotfix deploy in hours, to a version rollback over a day, to a migration unwound across weeks, to a public commitment you can't really take back.

Cheap reversibility lets you ship at a lower bar — you'll fix it in an afternoon. Expensive reversibility raises the bar, because "fix it later" isn't available. Most teams treat reversibility as a given instead of a dial, and that's the dimension the PM most often has to put on the table.

HealthCare.gov's three mismatches were a launch (not a feature ship), eight million users (not internal testers), and zero reversibility (not "we can hotfix") — three dimensions, all miscalibrated at once.

READY IS A VERB, NOT A STATE

You're never ready in the abstract — you're ready against a calibrated bar.

Reading the team's testing as signal

By the time you ask "is this ready?" the team has already chosen a bar. The bar is visible in three places — the rest of this section shows you where to look.

Acceptance criteria

Acceptance criteria are what "passing" actually means, and they read differently depending on who wrote them and when. AC written by the engineer who built the feature pass at a different bar than AC written by the PM/eng pair before the feature was built. The first is "does it do what I built." The second is "does it do what was asked." The gap between them is where shipped features quietly miss the brief — that's where calibration leaks out.

The PM-visible move is to ask where the AC live — Linear, Jira, a doc, nowhere — and who wrote them. The where-and-who tells you the bar. AC scribbled into the PR description that morning isn't the same bar as AC written into the story two weeks before code started.

Known issues

The known-issues list is the set of things the team already knows aren't right but is shipping anyway. The list itself is the signal. Long lists mean the team has been silently raising "what we accept" without naming the new bar — short lists mean either real rigor or hiding, and length alone won't tell you which.

Either way, you should see it. "How many known issues are open on this release?" is the single question that surfaces what the team's been deciding without you. If the answer is "a few" and nobody can name them, the bar is wherever the engineer's tolerance sits this week — a mood, not a calibration.

The "we'll fix it in v1.1" tax

Every deferred fix is a future task that competes with new feature work. The carry-cost is real, but the truer signal is the pattern of how often "v1.1 fixes" actually ship. Most teams treat the v1.1 promise as release-day comfort — it isn't a forecast.

If the last three "v1.1 fix" tickets are still open in the backlog four months later, the v1.1 tax isn't a fix-cost — it's a permanent reliability concession the team's been writing into every release. Don't grade the postmortem after the v1.1 ships. Grade the survival rate of past v1.1 commitments.

PM Insight

If you don't know what acceptance criteria the team is testing against, you've already shipped against the engineering default — and that default isn't your call.

The pre-ship conversation

The move that separates calibrated shipping from vibe-shipping is a five-minute conversation that names the bar before the release tag goes on. It's not a QA review — it's a calibration check. Three questions are the whole conversation, and you can run them in next week's pre-release meeting.

Calibrated against what?

Force the team to name the bar in plain language. The sentence you're listening for is "we've tested this against [scenario X, audience Y, recovery cost Z]." If nobody can say it that way, the bar exists implicitly — and an implicit bar is the engineering default from §3, not a calibration you signed off on.

You: "What did we test this against?"
Eng: "All the tests pass."
You: "Tests against what — the demo, the design partner, the enterprise rollout?"
Eng: "…"
That pause is where the bar was hiding. The named pause is the move; you don't need to fill it.

The launch list

Between "code complete" and "release tagged" sits a 24-hour checklist — smoke tests, monitor checks, on-call coverage, a rollback dry-run, customer comms staged. The list is finite, small, and writeable on one page. The team without one ships on vibes, and "vibes" isn't a bar.

Ask to see the launch list a week before tagging, not the morning of. A week out, the gaps are still cheap to close. The morning of, you're either shipping with the gaps or slipping the date — both worse than the conversation you could've had on Tuesday.

What we're accepting

The deliberate accept-and-ship decisions, named out loud. "We're shipping with bug X open because [why]." Once named, the trade is visible; until named, it's drift. Engineering will accept things silently — your job is to make the acceptance explicit.

"We're shipping with the slow first-load on Safari open because two enterprise prospects need this Friday and Safari is 6% of our user base." That sentence, spoken out loud in the pre-ship meeting, is this section's whole move. It turns a silent trade into a recorded one — and a recorded trade is something the team can revisit instead of inherit.

How this changes by stage

At finding fit: "Good enough" is whatever lets you learn fastest. The launch list is one engineer's vibe and that's appropriate — the cost of getting it wrong is one customer's confusion, easily recovered.

At operating at scale: "Good enough" is the SLA the customer signed. The launch list is governed by change-management policy; "what we're accepting" goes to a CAB review.

When the bar slips silently

Bars drift quietly. Nobody votes to lower one — it sits a little lower next release than it did the last. Catching the drift before it lands as an incident is this chapter's last move.

Bar drift

The pattern is recognizable once you've lived it. Each release's known-issues list runs a little longer. Each pre-ship conversation runs a little shorter. Each "v1.1 fix" is a little more likely to slip to v2, then quietly to never. By the time anyone notices, the bar is two notches below where it started, and nobody can name when it changed — because no single step felt like one.

The team that used to insist on a dry-run rollback before tagging now tags first and dry-runs after, if there's time. The on-call who used to read the launch list line by line now skims it. Each was a small accommodation, and the accommodations compounded.

The "we always do it this way" trap

When the bar gets challenged, the defense the team reaches for is the one that sounds least like a calibration: "This is how we ship here." That sentence is convention dressed as a standard, and it's the most common way a slipped bar survives a review.

The PM-shaped move is separating the two. Convention is "we ship this way because we always have." Calibration is "we ship this way because the audience and reversibility warrant it." When the answer's "always have," the bar is unanchored.

Re-anchoring

The corrective move is small. Once a quarter, or after any meaningful escalation, the team revisits the bar in plain language. Not a process ritual — a thirty-minute conversation that names what's been silently accepted and decides, out loud, whether to keep accepting it.

Re-anchoring produces a one-page doc the team can compare against next time. The output isn't a process artifact; it's a memory aid. Without it, every release recalibrates from the team's current mood instead of from the bar they set six months ago.

Bars drift quietly. Re-anchoring them is loud, deliberate, and the PM's job.

What changes when AI is in the loop

The question: What does AI change about "good enough"?

What's changed. Three things are concretely different:

The cost of "more thorough" has collapsed. AI-generated test scaffolds, fuzz inputs, property-based tests, and CI-driven regression detection have made it cheap to ship with high test coverage on surfaces where high coverage used to cost weeks. The team can be more thorough without spending more time — if they choose to.
AI-generated code has different testability properties. Code that's structurally novel but statistically common (the kind LLMs produce well) can pass unit tests while behaving subtly wrong in integration. The thing being tested isn't always what's being shipped.
The shipping rhythm itself has accelerated. AI-assisted PR review and ambient test generation in IDEs mean fewer bars are checked once-per-release; many are checked once-per-commit. The "are we ready?" conversation moves from a release moment to a continuous calibration.

What hasn't changed. The chapter's whole argument: the calibration question is still PM-shaped. AI can help the team meet a bar; it cannot pick the bar. Audience, reversibility, and the stakes of getting it wrong are still product judgments. AI doesn't have an opinion on which customer trust your team can afford to spend.

A second thing that hasn't changed: the cost of getting calibration wrong. AI lowers the friction of "more thorough," but it also lowers the friction of shipping faster — which means a miscalibration lands sooner. The bar that's wrong with AI in the loop is wrong faster.

What to watch for.

"AI generated and tested it" without naming the spec. The team that says this has handed both the bar and the verification to the AI. Ask: against what acceptance criteria did the AI test, and who reviewed them?
High auto-generated coverage as proof of thoroughness. 95% coverage from auto-generated tests can mean less than 70% coverage from tests a human pair wrote against acceptance criteria. The coverage number is the easy metric; what is covered is the question.
The over-investment trap. Because AI made thoroughness cheap, teams sometimes over-thoroughly test surfaces that don't need it — every internal admin tool with the rigor of the consumer flagship. Calibration goes both ways; "more is always better" is not the chapter's argument.
Continuous calibration without explicit re-anchoring. When the bar is checked at every commit, it's tempting to skip the once-a-quarter re-anchor conversation in §5. Don't. The drift pattern is harder to spot when the bar is implicit in CI rules nobody re-reads.

PM Playbook — Questions to ask

The next time a ship-readiness conversation lands on your desk, try these:

"What audience are we calibrating against?" — The team's default is "us internally"; the actual audience usually requires a higher bar.
"What's the rollback if this is wrong?" — If no clean rollback, the bar must be higher. If reversible cheaply, you can ship at a lower bar and learn.
"What acceptance criteria are we testing against, and who wrote them?" — If the answer is "the eng team decided," calibration has happened without you.
"What's on the known-issues list, and which of those are we choosing to ship despite?" — Separates deliberate accept-and-ship from un-surfaced issues.
"What's the launch list — what specifically gets done in the 24 hours before tagging?" — Tests whether there's a checklist or just a vibe.
"Compared to our last few launches, are we shipping at a higher or lower bar — and is that deliberate?" — The only way to catch silent drift is to actively compare.
"What would have to be true for us to NOT ship on the planned date?" — Surfaces the implicit "date fixed, scope variable" vs. "scope fixed, date variable" tradeoff.

Check your understanding 4 questions