Chapter 02
Stats Foundations
Why averages lie, what distributions actually tell you, and the numbers that matter most in product decisions.
The problem with "average"
Say you're a PM at a music app. Your analyst reports: "Users listen to an average of 27 minutes of music per day." Sounds healthy. So you ship a feature targeting that "average user."
But look closer. Maybe 60% of users listen for 2–5 minutes (background listening while commuting), and 15% of power users listen for 2–3 hours. The "27-minute user" barely exists — it's a mathematical artifact of averaging two very different groups.
You just designed for a ghost.
PM Insight
Whenever you see an average, ask: what does the distribution look like? Is everyone clustered near the average, or are there distinct groups pulling it in different directions? The answer changes what you build.
What a distribution actually is
A distribution is just a picture of how often each value appears in your data. Instead of collapsing everything into one number, it shows the full shape of behavior.
The most common shape is the normal distribution — the classic bell curve. Most values cluster near the middle, with fewer values at the extremes. Daily step counts, session durations for a well-designed product, and measurement errors all tend to look roughly normal.
But many things in product data don't. Revenue, virality, content engagement — these are often skewed, with a long tail of outliers that drag the average far from where most users actually are.
Interactive — Distribution Explorer
Drag the sliders to shift the mean and change the spread. Watch how the shape changes — and notice when the average stops being useful.
Mean, median, and mode — when each matters
Mean (average)
Add everything up, divide by count. Fast and familiar — but sensitive to extreme values. One power user spending $10,000 can make an average revenue metric look very different from what most users actually do.
Use it when: the distribution is roughly symmetric and outliers aren't distorting it.
Median
The middle value when everything is sorted. Half of users are above it, half below. Much more robust to outliers and skewed distributions.
Use it when: you care about the "typical" user, not the mathematical average. Revenue, session length, and load times are often better reported as medians.
PM Insight
You'll often see load times or error rates reported as P50, P75, P95, P99 — these are percentiles. P50 is the median (50% of values are below it). P95 means 95% of values are below that number — so if your app's P95 load time is 8 seconds, the slowest 5% of users wait 8 seconds or more. For a large product, that's millions of people. Averages hide this completely.
Mode
The most frequently occurring value. Useful for categorical data (most common OS, most popular plan tier) but rarely the right tool for continuous metrics.
Spread: the number that changes everything
Two products can have the same average NPS of 32 — one where most users score it 30–35, and one where half score it 10 and half score it 55. Same mean, completely different products. The difference is variance — how spread out the values are.
Standard deviation is the most common measure of spread. Think of it as the "typical distance" a value sits from the average. You'll sometimes see it written as the Greek letter σ (sigma) — "3 sigma" just means 3 standard deviations away from the mean. In a normal distribution:
- 68% of values fall within ±1 standard deviation of the mean
- 95% fall within ±2 standard deviations
- 99.7% fall within ±3 standard deviations — which is why "3 sigma events" are considered very rare
But most product metrics aren't normally distributed, so apply this framework with care.
PM Insight
When your data team reports a metric changed, ask about the standard deviation too. A 5% increase in a metric with very high variance might be pure noise. The same 5% increase in a stable, low-variance metric is worth paying attention to.
Skewed distributions and long tails
Many product metrics are right-skewed — most values are small, but a small number of very large values pull the mean to the right. Revenue, viral shares, support tickets — all tend to look like this.
In a right-skewed distribution, the mean is always higher than the median. The gap between them tells you how much outliers are distorting the average.
When you hear "our average revenue per user is $45," ask: what's the median? If the median is $8, your "average user" is mostly a fiction invented by a small number of heavy spenders.
Rule of thumb
Mean > Median → right-skewed, outliers pulling average up. Report median.
Mean ≈ Median → roughly symmetric. Mean is fine.
Mean < Median → left-skewed, unusually low values dragging average down. Report median.
Concrete example: why median holds up under skew
Here are the load times (in seconds) for 10 users opening your app:
User 9: 4.8 seconds (slow connection)
User 10: 38.1 seconds (something went very wrong)
The mean (5.3s) makes your app look sluggish for everyone — two outliers dragged it up from 1.3s. If you reported this in a review, you'd likely kick off a performance sprint that 80% of your users don't need.
The median (1.35s) tells the truth: the typical user loads fast. But it hides a real problem — User 10's 38-second experience is genuinely broken. Median alone would let that slip past.
The P90 (38.1s) catches it. You need all three numbers to get the full picture: median for typical experience, a high percentile for tail pain, and you can safely ignore the mean.
Interactive — Outlier impact on mean vs percentiles
Each dot is a user's load time. Add slow users and watch how each statistic responds. Notice which ones move and which ones hold steady.
Which percentile should you use?
Median (P50) is the right default for skewed data — but it answers only one question: what does the typical user experience? Different product questions need different percentiles.
| Percentile | What it answers | Use for |
|---|---|---|
| P25 | What does the bottom quarter experience? | Spotting disengaged users, catching low-end performance issues |
| P50 (median) | What does the typical user experience? | Revenue per user, session length, any metric where "average" is misleading |
| P75 | Where does the top quartile start? | Understanding power-user behavior, setting realistic targets |
| P90 / P95 | How bad does it get for the worst 10% / 5%? | Latency, load times, errors — anything where tail pain matters |
| P99 | What's the worst-case experience? | Infrastructure SLAs, catching catastrophic failures, high-volume systems |
| IQR (P75−P25) | How spread out is the middle of the data? | Measuring variability in skewed data — more robust than standard deviation |
The three-number summary for performance metrics
For any latency or load time metric, ask for these three together:
P50 — is the typical experience acceptable?
P90 or P95 — how bad is it for the worst-off users?
P99 — are there catastrophic outliers we need to fix?
A fast P50 and a broken P99 means most users are happy but some are having a terrible time. You'd never see that from a mean.
PM Insight
The IQR (interquartile range = P75 minus P25) is a much better measure of spread than standard deviation when your data is skewed. Standard deviation is distorted by outliers for the same reason the mean is. IQR ignores the tails entirely and tells you how much the middle 50% of your users vary — which is usually what you actually want to know.
PM Playbook — Questions to ask
- What's the median, not just the mean? — especially for revenue, session length, load times
- Can you show me the distribution? — a histogram (a bar chart showing how many users fall into each value range) reveals shape that a single number hides
- What are the p50, p90, p99? — critical for latency, errors, anything where tail behavior matters
- Is this metric skewed? — if so, who are the outliers and should we be designing for them?
- What's the standard deviation? — so you know if a change is larger than normal noise
- Are we segmenting before averaging? — combining different user cohorts (groups such as new vs. returning users, or users who signed up in different months) before averaging is almost always misleading