Chapter 04

A/B Testing & Experimentation

How experiments actually work, how to size them correctly, and the mistakes that silently invalidate results.

⏱ 20 min read 📊 Includes 2 interactives

Why we experiment at all

You've redesigned the checkout flow. Your designer is confident it's better. Your engineer is confident too. Usability research identified real friction on the payment form, and the new design fixes it. You formed a prior in Chapter 3: ~4.6% conversion, up from the current 4.2%. So why not just ship it?

Because everyone is always confident. And they're often wrong. Intuition about how users will behave is systematically biased by what we know, what we built, and what we want to be true. Experiments let reality vote.

An A/B test splits your users into two groups — a control (the current experience) and a treatment (the new one). Because the split is random, any difference in outcomes you observe is likely caused by the change, not by pre-existing differences between the groups.

PM Insight

The power of an A/B test isn't the statistics — it's the randomization. Random assignment is what lets you claim causation instead of just correlation. Without it, you have a before/after comparison, which is much weaker.

An experiment is the structured way to run the belief-update loop from Chapter 3. You start with a prior (the hypothesis), define what evidence would change your decision (the MDE), collect data under controlled conditions, and update. The structure prevents the two failure modes Bayesian reasoning guards against: overreacting to early noise, and rationalizing results after the fact.


The anatomy of an experiment

1. Hypothesis

Before you run anything, write down what you believe will happen, why, and by how much. Direction alone isn't enough — you need a quantified prediction.

WEAK
"We believe the new checkout design will improve conversion."

BETTER
"We believe the redesigned checkout will increase conversion by ~10% relative (4.2% → ~4.6%), because usability research shows users abandoning at the payment form step, which this design directly addresses."

The quantified estimate isn't a guess — it's the prior you built in Chapter 3: the last checkout redesign lifted conversion by 8%, benchmarks suggest 5–15% is achievable, and research pointed to a specific, fixable friction point. Combine those signals and 10% is a defensible estimate — not optimism, not false modesty.

This matters because your hypothesis feeds directly into your MDE (below). If you expect 10–15%, an MDE of 8% is sensible. If your hypothesis were only "I think it'll go up a bit," you'd have no basis for setting an MDE at all — and you'd likely run an experiment too small to detect anything real, or too large to be worth the time.

PM Insight

Your hypothesis and your MDE need to be consistent. If you pre-commit to an MDE of 10% but your prior is only 3%, something is off — either you're understating your expected effect, or the experiment isn't worth running. Writing down a quantified hypothesis forces that inconsistency into the open before you spend weeks running a test.

2. Primary metric

One metric you're trying to move. Not three. Not five. One. Adding more metrics increases the chance you'll find something significant by accident and declare victory on a measurement that doesn't matter.

For the checkout redesign: checkout conversion rate. That's it. Everything else is secondary.

3. Guardrail metrics

Metrics you're not trying to move, but would care deeply about if they did. These are your safety net — a win on one metric that breaks another isn't a win.

For the checkout redesign: revenue per session (conversion could go up while order value drops), page load time (a slower checkout might convert less at scale), and payment error rate (a new form could introduce new failure modes).

4. Minimum detectable effect (MDE)

The smallest improvement that would actually matter for your business — set from business logic, not statistical convenience. If a 2% relative lift in checkout conversion wouldn't change your roadmap, don't power the experiment to detect it.

For the checkout redesign: you expect ~10% relative lift and you pre-committed in Chapter 3 to treating anything under 5% as not worth shipping. So your MDE is 5%. That tells you exactly how large a sample you need.

PM Insight

MDE is a business decision, not a statistics question. Ask: "If the real effect is X, would we ship this?" Work backward from there. Setting a realistic MDE will save you from running experiments that are either impossibly long or detect effects too small to act on.


Sample size: the question everyone skips

Most experiment mistakes happen before the experiment starts. Teams launch tests without checking how many users they actually need — then stop when the dashboard shows significance, or when they get impatient, whichever comes first.

Sample size depends on three things:

The lower the baseline rate and the smaller the effect you're trying to detect, the more users you need. Often far more than teams expect.

For the checkout redesign: baseline 4.2%, MDE 5% relative lift, and say 3,000 users hit checkout daily. Try those numbers in the calculator — you'll see why teams that launch on gut feel and check after a week almost always stop too early.

Interactive — Sample Size Calculator

Set your baseline rate, the minimum lift you care about, and your daily traffic. See how long your experiment needs to run. (Assumes 95% confidence, 80% power.)

Per variant
Total users
Time to run

The formula behind the calculator

The interactive solves the standard two-proportion sample size formula. For each variant you need approximately:

n = (zα/2 + zβ)2 × [p1(1−p1) + p2(1−p2)] / (p2 − p1)2

Why the calculator behaves the way it does

The denominator is squared, so n grows with 1 / (effect size)2. Halving the MDE you want to detect quadruples the sample needed. This is why "let's also catch a 1% lift" can turn a two-week test into a six-month one — and why anchoring MDE in business value (rather than statistical convenience) matters so much. It's also why low-baseline metrics need so much more traffic: a +5% relative lift on a 1% conversion rate is a much smaller absolute gap than the same lift on a 30% rate, so the (p2 − p1)2 term shrinks and n explodes.


The peeking problem

You launch the checkout redesign test. After three days you check the dashboard. The new flow is winning — p = 0.04, just under 0.05. You call it. Ship it.

This is one of the most common and costly mistakes in product experimentation. It's called peeking, and it silently inflates your false positive rate.

The intuition — two horses on a track

Picture two horses racing — equally fast, same training. At any moment one is a few feet ahead because of random stride variations. Call the winner only at the finish line and you'll correctly conclude they're tied. But stand by the rail and yell "A is winning!" the moment A pulls ahead by a step, and you'll declare A the winner on most races — you're not measuring speed, just catching whichever horse is on a good stride when you happened to look.

An A/B test wobbles the same way. Even when A and B are identical, the running tally drifts up and down. At some moment, one variant will be "significantly" ahead by chance. p < 0.05 only does what it says on the tin if you commit to checking once, at a pre-decided sample size. Check every day and stop the first time it dips under 0.05, and you've effectively run a dozen tests and picked the most flattering one.

What it actually costs

At 95% confidence you've agreed to a 5% false positive rate per test. Checking every day for two weeks can easily push the real false positive rate to 4–6× what you signed up for — often 20% or more. Meaning roughly a quarter of the early "wins" you call are noise dressed up as a result. Continuous real-time monitoring is worse.

The simulation below makes this concrete. It runs 10 A/A tests (no real difference between A and B) and shows how often random noise alone crosses the significance threshold when peeking is allowed. At 95% confidence you'd expect about 5% of runs to cross by chance. Watch what actually happens.

Interactive — The Peeking Problem

Each click runs 10 A/A experiments (no real difference between A and B). Red lines crossed the significance threshold somewhere during the test. Use the slider below the chart to drop a "I'll check at exactly this n" line — the result panel updates with the false-positive rate at that one committed checkpoint. Compare it to the peeking rate (stop the moment p < 0.05). At 95% confidence, any pre-committed single check should sit near 5% — wherever you put it. Watch what peeking does to that.

Run the simulation to see how often random noise crosses the significance threshold.

PM Insight

Pre-commit to your sample size before launching. Check results exactly once, when the predetermined sample is reached. If your tooling supports sequential testing or Bayesian experimentation (e.g. Optimizely, Eppo), you can check earlier — but only if the framework is designed for it. Dashboard p-values are not.

So when can you check?

The intuition falls out of the simulation directly. The dashed vertical line on the chart above is your "I'll check at exactly this n" line — drag the slider and watch the "single check at n = X" rate stay near 5% no matter where you put it. Commit to a spot in advance, only look there, and you keep the 5% you signed up for. The "Peeking" rate is what happens when you instead slide the line freely until you find a moment where the data happens to cross significance — cherry-picking the most flattering spot.

Which makes the rule simple: pick your sample size first, don't look until you reach it, then check once. The sample-size formula isn't really telling you "how much data to collect" — it's telling you the one moment at which it's statistically safe to look. Everything before that is noise; everything after is fine but wasteful.

Two pre-committed alternatives let you check sooner without breaking the math. Sequential testing frameworks (Optimizely, Eppo, group-sequential designs) apply adjusted thresholds that compensate for repeated checks — but only if you're using a tool designed for it, not dashboard p-values. Pre-registered interim analyses let you check at fixed waypoints (say 50% and 100% of sample) with stricter thresholds at the interim. Same principle in both: every check is decided before the data arrives, never in response to what the data is doing.

Validate your setup first: A/A tests

Before running a high-stakes experiment, run an A/A test — both groups get exactly the same experience, with no change at all. If your infrastructure is working correctly, you should see no significant result. If you do, something is broken: random assignment, event logging, or the analysis pipeline itself.

Think of it as a calibration check. The simulation above is essentially an A/A test — notice how often random noise alone produces p < 0.05. In a real A/A, you'd expect roughly 5% of runs to cross that threshold by chance. If you're seeing 20%, your setup has a bug, not a real effect.

When to run an A/A test

✓ Before your first experiment on a new experimentation platform
✓ Before a high-stakes test where a false positive would be costly
✓ After a significant change to your logging or assignment infrastructure
✓ Any time you've seen unexplained SRMs or suspiciously significant results


Other experiment pitfalls PMs need to know

Novelty effect

Users sometimes engage more with something simply because it's new — not because it's better. If your experiment runs for only a few days, you might be measuring novelty, not value. For features users interact with repeatedly, run long enough to see behavior stabilize.

Network effects and spillover

If users in your control and treatment groups interact with each other (social features, marketplaces, collaborative tools), the treatment leaks. Users in the control see behavior from treatment users and change. This violates the independence assumption experiments require. Clustering by team, household, or geography is often the fix.

Multiple comparisons

Testing 10 metrics simultaneously? By random chance alone, you'd expect about one of them to show significance even with no real effect. The more things you test, the more false positives you generate. Pre-declare your primary metric and treat everything else as exploratory, not conclusive.

SRM — Sample Ratio Mismatch

You intended a 50/50 split. Small deviations are expected from randomness — but if your control has 10,800 users and treatment has 9,200, that gap is too large to be plausibly random. That's a Sample Ratio Mismatch (SRM) — a sign that something went wrong in how users were assigned or how events were logged.

Ask your analyst to verify the split is close to what you intended. They'll run a quick statistical check comparing observed group sizes to expected ones — if the difference is unlikely to be chance, you have an SRM. The cause is usually a bug in the assignment logic, a filtering step that's only applied to one group, or an event logging error. Until it's resolved, the two groups aren't comparable and the results can't be trusted — regardless of what the p-value says.

Before you ship a "winner"

✓ Did we reach the pre-specified sample size?
✓ Did we check for sample ratio mismatch?
✓ Did any guardrail metrics move negatively?
✓ Is the effect size large enough to actually matter?
✓ Have we run long enough to rule out novelty effect?


When not to A/B test

Experimentation is powerful but not always appropriate. Skip it when:


When user-level randomization breaks: switchback experiments

Standard A/B tests assume you can split users into independent groups — what one user sees doesn't affect what another user experiences. That assumption breaks for anything that touches shared supply, pricing, or system-wide logic.

Consider a ride-sharing dispatch algorithm. If you show treatment riders a new matching logic while control riders use the old one, they're competing for the same drivers. The treatment is leaking into the control group. Your experiment is measuring noise, not the algorithm's true effect.

A switchback experiment solves this by randomizing over time, space, or both — instead of over users. The entire system alternates between control and treatment on a schedule: odd hours run the new algorithm, even hours run the old one. For products with strong geographic separation (a ride-share network in NYC doesn't compete for drivers with one in SF), you can also randomize across cities or zones, giving you several parallel switchbacks at once. Most teams combine the two — NYC odd hours, NYC even hours, SF odd hours, SF even hours — four cells, each running both variants without contaminating each other.

When switchback experiments are the right tool

Marketplace algorithms — ranking, pricing, or matching logic where showing treatment to 50% of users changes the experience for the other 50%.

Shared infrastructure — backend changes that affect all traffic simultaneously (caching strategies, load balancing, ML serving latency).

Supply-side changes — anything that affects inventory, driver supply, or seller behavior in a two-sided market.

How a switchback gets analyzed

A standard A/B test treats each user as one observational unit. A switchback treats each switch period (or each region × period bucket) as one observational unit — every user inside that bucket gets pooled into a single aggregated metric for that cell. So you'll typically have far fewer observational units than a user-level test: maybe a few hundred period-cells over a few weeks, versus tens of thousands of users.

The compensation is that each unit has much lower variance, because aggregating across many users averages out per-user noise within the cell. Your DS team is hunting for the sweet spot: shorter periods give you more observational units (more statistical power) but increase carryover bias from one cell into the next; longer periods give cleaner units but fewer of them. The right interval depends on how quickly the system reaches a new equilibrium after a switch.

You don't need to pick the interval — but you should understand why your DS team is optimizing for it. Asking "what's the unit of analysis here, and how many do we have?" gets you straight to whether the experiment is even powered to detect the effect you care about.

PM Insight

When your DS team proposes a switchback, it's because standard user-level randomization would contaminate the control group and give you a meaningless result. The tradeoff is real: switchbacks require more calendar time (many alternating periods to average out time-of-day and day-of-week effects), and carryover behavior between periods can still bias results if the intervals are too short. But for system-level changes, it's the right tool — a bad A/B test is worse than no test.


PM Playbook — Questions to ask


4 questions