Chapter 04

A/B Testing & Experimentation

How experiments actually work, how to size them correctly, and the mistakes that silently invalidate results.

⏱ 20 min read 📊 Includes 2 interactives

Why we experiment at all

You've redesigned the checkout flow. Your designer is confident it's better. Your engineer is confident too. Usability research identified real friction on the payment form, and the new design fixes it. You formed a prior in Chapter 3: ~4.6% conversion, up from the current 4.2%. So why not just ship it?

Because everyone is always confident. And they're often wrong. Intuition about how users will behave is systematically biased by what we know, what we built, and what we want to be true. Experiments let reality vote.

An A/B test splits your users into two groups — a control (the current experience) and a treatment (the new one). Because the split is random, any difference in outcomes you observe is likely caused by the change, not by pre-existing differences between the groups.

PM Insight

The power of an A/B test isn't the statistics — it's the randomization. Random assignment is what lets you claim causation instead of just correlation. Without it, you have a before/after comparison, which is much weaker.

An experiment is the structured way to run the belief-update loop from Chapter 3. You start with a prior (the hypothesis), define what evidence would change your decision (the MDE), collect data under controlled conditions, and update. The structure prevents the two failure modes Bayesian reasoning guards against: overreacting to early noise, and rationalizing results after the fact.

The anatomy of an experiment

1. Hypothesis

Before you run anything, write down what you believe will happen, why, and by how much. Direction alone isn't enough — you need a quantified prediction.

WEAK
"We believe the new checkout design will improve conversion."

BETTER
"We believe the redesigned checkout will increase conversion by ~10% relative (4.2% → ~4.6%), because usability research shows users abandoning at the payment form step, which this design directly addresses."

The quantified estimate isn't a guess — it's the prior you built in Chapter 3: the last checkout redesign lifted conversion by 8%, benchmarks suggest 5–15% is achievable, and research pointed to a specific, fixable friction point. Combine those signals and 10% is a defensible estimate — not optimism, not false modesty.

This matters because your hypothesis feeds directly into your MDE (below). If you expect 10–15%, an MDE of 8% is sensible. If your hypothesis were only "I think it'll go up a bit," you'd have no basis for setting an MDE at all — and you'd likely run an experiment too small to detect anything real, or too large to be worth the time.

PM Insight

Your hypothesis and your MDE need to be consistent. If you pre-commit to an MDE of 10% but your prior is only 3%, something is off — either you're understating your expected effect, or the experiment isn't worth running. Writing down a quantified hypothesis forces that inconsistency into the open before you spend weeks running a test.

2. Primary metric

One metric you're trying to move. Not three. Not five. One. Adding more metrics increases the chance you'll find something significant by accident and declare victory on a measurement that doesn't matter.

For the checkout redesign: checkout conversion rate. That's it. Everything else is secondary.

3. Guardrail metrics

Metrics you're not trying to move, but would care deeply about if they did. These are your safety net — a win on one metric that breaks another isn't a win.

For the checkout redesign: revenue per session (conversion could go up while order value drops), page load time (a slower checkout might convert less at scale), and payment error rate (a new form could introduce new failure modes).

4. Minimum detectable effect (MDE)

The smallest improvement that would actually matter for your business — set from business logic, not statistical convenience. If a 2% relative lift in checkout conversion wouldn't change your roadmap, don't power the experiment to detect it.

For the checkout redesign: you expect ~10% relative lift and you pre-committed in Chapter 3 to treating anything under 5% as not worth shipping. So your MDE is 5%. That tells you exactly how large a sample you need.

PM Insight

MDE is a business decision, not a statistics question. Ask: "If the real effect is X, would we ship this?" Work backward from there. Setting a realistic MDE will save you from running experiments that are either impossibly long or detect effects too small to act on.

Sample size: the question everyone skips

Most experiment mistakes happen before the experiment starts. Teams launch tests without checking how many users they actually need — then stop when the dashboard shows significance, or when they get impatient, whichever comes first.

Sample size depends on three things:

Baseline rate — what's the current conversion / activation / whatever you're measuring?
Minimum detectable effect — how big a change would matter? (Expressed as % lift over baseline)
Confidence level — how sure do you want to be? (Typically 95%)

The lower the baseline rate and the smaller the effect you're trying to detect, the more users you need. Often far more than teams expect.

For the checkout redesign: baseline 4.2%, MDE 5% relative lift, and say 3,000 users hit checkout daily. Try those numbers in the calculator — you'll see why teams that launch on gut feel and check after a week almost always stop too early.

Interactive — Sample Size Calculator

Set your baseline rate, the minimum lift you care about, and your daily traffic. See how long your experiment needs to run. (Assumes 95% confidence, 80% power.)

Baseline conversion rate 10%

Minimum detectable effect (relative lift) +10%

Daily users entering experiment 5,000

Per variant

—

Total users

—

Time to run

—

The peeking problem

You launch the checkout redesign test. After three days you check the dashboard. The new flow is winning — p = 0.04, just under 0.05. You call it. Ship it.

This is one of the most common and costly mistakes in product experimentation. It's called peeking, and it silently inflates your false positive rate.

Here's why: when you run a statistical test repeatedly as data comes in, you're not running one test — you're running many. Each time you check, there's a chance the random variation in your data temporarily crosses the significance threshold. If you stop the moment that happens, you'll declare winners far more often than your 5% false positive rate implies.

The simulation below shows what happens when there's no real difference between A and B (a true A/A test). Watch how often random noise alone produces a "significant" result when you allow peeking.

Interactive — The Peeking Problem

Each run simulates 10 A/A experiments (no real difference). Red lines are tests that crossed the significance threshold somewhere along the way — all false positives. At 95% confidence you'd expect about 5%. Watch what peeking does to that number.

Run the simulation to see how often random noise crosses the significance threshold.

PM Insight

Pre-commit to your sample size before launching. Check results exactly once, when the predetermined sample is reached. If your tooling supports sequential testing or Bayesian experimentation (e.g. Optimizely, Eppo), you can check earlier — but only if the framework is designed for it. Dashboard p-values are not.

Validate your setup first: A/A tests

Before running a high-stakes experiment, run an A/A test — both groups get exactly the same experience, with no change at all. If your infrastructure is working correctly, you should see no significant result. If you do, something is broken: random assignment, event logging, or the analysis pipeline itself.

Think of it as a calibration check. The simulation above is essentially an A/A test — notice how often random noise alone produces p < 0.05. In a real A/A, you'd expect roughly 5% of runs to cross that threshold by chance. If you're seeing 20%, your setup has a bug, not a real effect.

When to run an A/A test

✓ Before your first experiment on a new experimentation platform
✓ Before a high-stakes test where a false positive would be costly
✓ After a significant change to your logging or assignment infrastructure
✓ Any time you've seen unexplained SRMs or suspiciously significant results

Other experiment pitfalls PMs need to know

Novelty effect

Users sometimes engage more with something simply because it's new — not because it's better. If your experiment runs for only a few days, you might be measuring novelty, not value. For features users interact with repeatedly, run long enough to see behavior stabilize.

Network effects and spillover

If users in your control and treatment groups interact with each other (social features, marketplaces, collaborative tools), the treatment leaks. Users in the control see behavior from treatment users and change. This violates the independence assumption experiments require. Clustering by team, household, or geography is often the fix.

Multiple comparisons

Testing 10 metrics simultaneously? By random chance alone, you'd expect about one of them to show significance even with no real effect. The more things you test, the more false positives you generate. Pre-declare your primary metric and treat everything else as exploratory, not conclusive.

SRM — Sample Ratio Mismatch

You intended a 50/50 split. Small deviations are expected from randomness — but if your control has 10,800 users and treatment has 9,200, that gap is too large to be plausibly random. That's a Sample Ratio Mismatch (SRM) — a sign that something went wrong in how users were assigned or how events were logged.

Ask your analyst to verify the split is close to what you intended. They'll run a quick statistical check comparing observed group sizes to expected ones — if the difference is unlikely to be chance, you have an SRM. The cause is usually a bug in the assignment logic, a filtering step that's only applied to one group, or an event logging error. Until it's resolved, the two groups aren't comparable and the results can't be trusted — regardless of what the p-value says.

Before you ship a "winner"

✓ Did we reach the pre-specified sample size?
✓ Did we check for sample ratio mismatch?
✓ Did any guardrail metrics move negatively?
✓ Is the effect size large enough to actually matter?
✓ Have we run long enough to rule out novelty effect?

When not to A/B test

Experimentation is powerful but not always appropriate. Skip it when:

Traffic is too low. If you'd need 18 months to reach significance, you're better off with qualitative methods or a holdout group after full launch.
The change is irreversible. Pricing changes, account deletions, legal requirements — things you can't undo shouldn't be tested the same way reversible UI changes are.
You're in a chicken-and-egg situation. If the new feature only creates value when many users have it (e.g. a messaging feature), a 50/50 split will always underestimate the true effect at scale.
The hypothesis requires longitudinal data. A/B tests measure short-term behavior change. If the effect takes months to manifest (retention, churn), consider cohort analysis or holdouts instead.
The treatment affects the whole system, not individual users. Pricing algorithms, dispatch logic, marketplace ranking — changes that interact across users contaminate the control group. This is where switchback experiments come in.

When user-level randomization breaks: switchback experiments

Standard A/B tests assume you can split users into independent groups — what one user sees doesn't affect what another user experiences. That assumption breaks for anything that touches shared supply, pricing, or system-wide logic.

Consider a ride-sharing dispatch algorithm. If you show treatment riders a new matching logic while control riders use the old one, they're competing for the same drivers. The treatment is leaking into the control group. Your experiment is measuring noise, not the algorithm's true effect.

A switchback experiment solves this by randomizing over time instead of users. The entire system alternates between control and treatment on a schedule — for example, odd hours run the new algorithm, even hours run the old one. Because the whole system switches, there's no contamination between groups.

When switchback experiments are the right tool

Marketplace algorithms — ranking, pricing, or matching logic where showing treatment to 50% of users changes the experience for the other 50%.

Shared infrastructure — backend changes that affect all traffic simultaneously (caching strategies, load balancing, ML serving latency).

Supply-side changes — anything that affects inventory, driver supply, or seller behavior in a two-sided market.

PM Insight

When your DS team proposes a switchback, it's because standard user-level randomization would contaminate the control group and give you a meaningless result. The tradeoff is real: switchbacks require more calendar time (many alternating periods to average out time-of-day and day-of-week effects), and carryover behavior between periods can still bias results if the intervals are too short. But for system-level changes, it's the right tool — a bad A/B test is worse than no test.

PM Playbook — Questions to ask

What's the minimum effect size that changes our decision? — set the MDE from business logic, not statistical convenience
Did we calculate sample size before launching? — never start without knowing when you'll stop
Are we checking results before the sample is reached? — peeking is a silent killer of experiment integrity
What's the sample ratio? — verify the split is what you intended before analyzing
Did any guardrail metrics move? — a win on one metric with a loss on another is not a win
Is this a novelty effect? — would we see the same result if we ran it another two weeks?
Did we run an A/A test first? — especially before high-stakes experiments or on a new platform
Does the treatment affect the whole system? — if yes, user-level randomization is contaminated; consider a switchback
Should we even be running an experiment here? — know when qualitative or holdout approaches are better

Check your understanding 4 questions