Chapter 05

Metrics Design

How to build a metrics framework that drives real decisions — and stop optimizing for numbers that don't matter.

⏱ 20 min read 📊 Includes interactive

The metric that almost killed a product

A growth team at a consumer app was measured on daily active users (DAU). They hit their targets consistently — but the product was slowly dying. Retention was falling. Revenue per user was flat. The team had found many creative ways to drive DAU: aggressive push notifications, dark-pattern re-engagement flows, even counting a user as "active" if they opened a notification without touching the app.

The metric was going up. The business was going down. They had optimized for the measure, not the thing the measure was supposed to represent.

This is Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." It's not a corner case — it's the default outcome of poorly designed metrics systems.

PM Insight

Every metric is a proxy for something you actually care about. The job of metrics design is to choose proxies that are hard to game, move when the real thing moves, and stay stable when the real thing doesn't.

The three-layer metrics framework

Most strong product teams organize their metrics into three layers that answer three different questions:

1. North star metric

The single number that best captures whether your product is delivering value to users and to the business. It's your long-term health indicator. Moving it should feel meaningful, and it shouldn't be easily gamed.

Good north star metrics sit at the intersection of user value and business value. Spotify's is time spent listening. Airbnb's is nights booked. Notion's is weekly active editors (not just viewers). Each captures the core value exchange.

The north star test

Ask: "If this number went up consistently for 6 months, would the business definitely be healthier?" If the answer is "yes, but only if it goes up for the right reasons" — you've found a metric that's gameable and you need guardrails. If yes with no caveats, you have a strong north star.

2. Input metrics

Also called leading indicators or driver metrics. These are the upstream behaviors that, when they move, tend to cause the north star to move. They're more actionable than the north star because teams can directly influence them.

If Spotify's north star is listening time, input metrics might include: new playlist creates per user, podcast starts in first week, search-to-play conversion, discovery feature engagement. Each of these, when it improves, tends to cause listening time to grow.

3. Guardrail metrics

Metrics you're not trying to move, but that you'd care deeply about if they moved against you. They protect against teams optimizing input metrics in harmful ways.

If the podcast team is growing listening time by making it harder to skip episodes, the guardrail is user satisfaction (measured via rating, cancellation rate, or NPS). If the notification team is growing DAU by spamming users, the guardrail is unsubscribe rate.

Interactive — Build a Metrics Tree

Choose a product type and see how a three-layer metrics framework applies.

Leading vs lagging indicators

Lagging indicators measure outcomes — they confirm what already happened. Revenue, churn, NPS. They're important but slow. By the time they move, the cause is often weeks or months in the past.

Leading indicators predict future outcomes — they're correlated with where the lagging metric is heading. They move earlier and give you time to act.

Example: SaaS retention

Lagging: Monthly churn rate — you only know it churned after it churned.
Leading: Feature adoption in week 2, support ticket volume, login frequency in the 30 days before renewal — these predict churn before it happens and give you time to intervene.

Example: Consumer engagement

Lagging: 30-day retention — you know the user left, not why or when.
Leading: Number of core actions in days 1–7, return visits in week 1, connection count (for social products). Users who hit certain leading thresholds in the first week retain at dramatically higher rates.

PM Insight

The most valuable thing you can do with your data team is find your product's leading indicators for retention. Ask: "What do users who are still active at 90 days do differently in their first two weeks?" That analysis will surface your real input metrics — and tell you exactly what new user experience needs to drive.

Vanity metrics: the ones that feel good but don't decide anything

Vanity metrics are numbers that go up and make people feel good, but don't change any decision you'd make. They're usually easy to move, hard to connect to outcomes, and beloved by stakeholders who want to report wins.

Vanity metrics

Total registered users
Page views
App downloads
Social media followers
Emails sent
Features shipped

Actionable alternatives

Active users (defined precisely)
Pages per session, task completion
D7 / D30 retention after install
Engagement rate, DM conversions
Email open → action rate
Features used, adoption rate

The test for vanity: "If this metric doubled tomorrow, what would we do differently?" If the answer is nothing, it's a vanity metric. Cut it from your review deck.

How teams game metrics — and how to prevent it

Goodhart's Law is automatic. The moment a metric becomes a team's goal, people find ways to hit it that don't require doing the underlying thing. This isn't malice — it's how incentives work. Your job as a PM is to design metrics that are hard to game without doing the real work.

Click below to see common gaming scenarios and their consequences:

Interactive — Spot the Gaming

Click each scenario to reveal what happens when teams optimize for it directly.

Team is measured on DAU (daily active users)

Growth team sends daily push notifications to every user whether or not the notification is relevant.

DAU goes up. Notification opt-out rate spikes 40%. 90-day retention falls. You've trained users to ignore you.

Team is measured on feature adoption %

PM adds a mandatory prompt on login that forces users to interact with the feature once before dismissing.

Adoption metric hits target. Repeat usage is near zero. The feature looks healthy in reporting; it's dead in practice.

Team is measured on support ticket resolution time

Support team closes tickets quickly by marking issues as resolved without confirming with the user.

Resolution time falls. Reopen rate triples. CSAT scores drop. You've made the metric look good and the experience worse.

Team is measured on number of experiments run

Team runs many tiny, low-stakes tests (button colour, copy tweaks) to hit their experiment count target.

Experiment count soars. No high-impact experiments get run because they're riskier and harder to set up. Velocity theater at its finest.

PM Insight

The antidote to gaming is pairing every primary metric with at least one guardrail that would catch the gaming behavior. DAU + notification opt-out rate. Feature adoption + 30-day repeat usage. Resolution time + reopen rate. Each pairing makes it harder to move one without the other noticing.

The HEART framework

Google's HEART framework is a structured way to think about user experience quality at scale. It's not a replacement for a metrics tree, but it's a useful lens for making sure you're not missing entire categories of user health.

H — Happiness

NPS / CSAT App store rating Post-task satisfaction

E — Engagement

Sessions/week Actions per session Core feature usage depth

A — Adoption

New users reaching activation Feature adoption rate Time to first value

R — Retention

D7 / D30 retention Monthly churn Cohort survival curves

T — Task success

Task completion rate Error rate Time on task

HEART is most useful early in planning. Running through the five letters will often surface a category of user health you'd otherwise ignore — particularly Happiness and Task Success, which product teams frequently skip in favour of purely quantitative metrics.

Counter metrics: the PM's safety net

A counter metric is a metric that should move in the opposite direction if you're gaming your primary metric. The idea is simple: pair every metric you're optimizing with one that catches the failure mode.

Optimizing for	Gaming risk	Counter metric
DAU	Notification spam, dark patterns	Notification opt-out rate, D30 retention
Activation rate	Forced onboarding, mandatory steps	D7 retention of activated users
Revenue per user	Aggressive upsells, dark pricing patterns	Churn rate, NPS, refund rate
Session length	Friction-heavy flows, hard to exit	Task completion rate, frustration signals
Click-through rate	Clickbait, misleading previews	Post-click engagement, bounce rate

Defining "active" — the decision most teams skip

"Active user" is the most-cited and least-defined metric in product. Before you can track it, you need to answer:

What counts as active? Opening the app? Completing a core action? Using a specific feature?
Over what time window? Daily, weekly, monthly? The right window depends on your product's natural usage frequency.
Which users are included? All users? Paying only? Above a certain age or region?

A project management tool with one login per week is highly engaged. A news app with one login per week is churning. "Active" means different things for different products — and using the wrong definition produces metrics that look healthy and aren't.

Rule of thumb: usage frequency dictates window

Daily utility products (chat, email, calendar): DAU / WAU
Weekly workflow tools (project management, docs): WAU
Monthly or event-driven (tax software, travel): MAU or cohort analysis
The window should match how often a healthy user would naturally return.

PM Playbook — Questions to ask

What's our north star? — one number that captures user value and business health simultaneously
What are the input metrics that cause the north star to move? — run the analysis; don't assume
What's the counter metric for each thing we're optimizing? — pair every driver metric with a guardrail
If this metric doubled, what would we do differently? — if the answer is nothing, cut it from the deck
How is "active" defined? — never accept an active user number without knowing the definition
Are we measuring leading or lagging? — your review deck should have both
What are the 5 ways someone could hit this target without improving the product? — name the gaming scenarios before you set the goal

Check your understanding 4 questions