Chapter 09
Evaluating ML Models
Translating model metrics into business decisions — what to measure, what it costs when you're wrong, and how to know when a model is degrading.
Evaluation is a product problem
Model evaluation is usually framed as a technical exercise: compute accuracy, AUC, RMSE, ship. But evaluation is fundamentally a product question — is this model good enough for the specific job it needs to do?
A fraud model with 95% recall might be excellent for a payment processor and catastrophic for a social platform where false accusations damage user trust. The metric doesn't change. The context does. PMs own that context.
PM Insight
Your job in model evaluation is to define the acceptance criteria before the model is built — not to react to whatever numbers the team reports afterward. "What level of performance would make us confident enough to ship?" is a product decision. Make it explicitly.
Offline vs online evaluation
There are two phases of model evaluation, and conflating them is a common source of painful surprises.
Offline evaluation
Evaluation on a held-out dataset before deployment. Fast, cheap, reproducible. This is where AUC, precision, recall, RMSE live. The limitation: your held-out data is a snapshot of the past. It doesn't tell you how the model behaves with real users making real decisions in a live product.
Online evaluation
Evaluation in production — usually via A/B test comparing model vs baseline on the business metrics you actually care about. Slower and costlier, but it's the only thing that proves the model moves the needle where it matters. A model that looks great offline sometimes fails online because the offline dataset didn't capture the full distribution of real-world inputs.
The evaluation hierarchy
Offline metrics → does the model work on historical data?
Shadow mode → run the new model in production alongside the existing one, logging its predictions without acting on them — so you can compare real-world behaviour without exposing users to failures
Online A/B test → does the model improve the business metric?
Post-launch monitoring → does it keep working over time?
Pick one metric — and treat the rest as constraints
Most models can be evaluated on multiple dimensions: precision, recall, latency, cost per inference, false positive rate. Tracking all of them simultaneously makes decisions slow and arguments endless. The most effective teams agree on one metric to optimise — and treat everything else as a constraint to satisfy.
Optimising vs satisficing metrics
An optimising metric is the one number you're trying to maximise (or minimise). For a churn model it might be F1. For a recommendation system, click-through rate. You want this as high as possible.
Satisficing metrics are constraints: thresholds you need to meet, but don't need to exceed. Inference latency under 150ms. False positive rate below 3%. Hallucination rate below 1 in 500 responses. Once the threshold is met, more is not better.
Example: your content moderation model optimises for recall (catch as much harmful content as possible) subject to: precision ≥ 85% (false accusation rate is acceptable), latency ≤ 200ms (doesn't block post submission). One number to improve. Two constraints to not break.
The reason this matters for PMs: when the team presents a model improvement, it should move the optimising metric without violating any satisficing constraint. If it improves recall but blows the latency budget, it's not a ship. Defining this framework before development starts makes those conversations crisp.
PM Insight
If your team can't agree on one optimising metric, that's usually a sign the product goal isn't clear yet — not a modelling problem. Resolve the product question first. "We're maximising for X, subject to Y and Z" is a sentence worth writing down before a single model is trained.
Regression metrics: when the output is a number
Classification has precision/recall. Regression — predicting a continuous value like revenue, demand, or delivery time — has its own evaluation vocabulary.
MAE — Mean Absolute Error
Average of |predicted − actual|. Easy to interpret: "we're off by X units on average." Less sensitive to large errors.
RMSE — Root Mean Squared Error
Penalises large errors more heavily than MAE. Use when big mistakes are especially costly. Same units as the target variable.
R² — Coefficient of Determination
Fraction of variance explained by the model. R² of 0.85 means the model explains 85% of the variation. Beware: can be misleadingly high with many features.
PM Insight
For demand forecasting, ask: "What does a 10% error cost us?" MAE of 50 units sounds abstract. "We over-order by 50 units per day at $12/unit carrying cost" is a business problem. Always translate regression error into operational impact.
The business cost of errors
Statistical metrics weight all errors equally. Business doesn't. A false negative in cancer screening is life-or-death; in a content recommendation it's a missed click. The metric is the same; the cost is completely different.
The right way to set a classification threshold isn't to maximise F1 — it's to minimise total business cost. That means putting a real number on each error type.
Interactive — Business Cost Optimizer
Set the cost of each error type for your use case. The curve shows total cost at every possible threshold. The green dot marks the threshold that minimises total cost.
Error analysis: find out where to improve before you start improving
When a model underperforms, the instinct is to immediately try fixes: collect more data, tune hyperparameters, try a different architecture. Most of the time that energy is wasted — because the team hasn't established where the errors are actually coming from.
Error analysis is the practice of manually examining a sample of the model's mistakes, categorising them, and tallying which categories dominate. It takes a few hours and routinely changes the direction of weeks of engineering work.
How to run an error analysis
Take 100 examples the model got wrong from your dev set. For each one, note why it was wrong. Group the errors into categories. Count them.
Example — a content moderation model has 20% error rate. You sample 100 mistakes:
- 42 are memes (image-only content the model can't parse)
- 31 are non-English posts
- 18 are sarcasm or irony
- 9 are other
Even if you perfectly solved the sarcasm problem, you'd only fix 18% of your errors. Memes and non-English content are the higher-leverage bets — and that's a product and data decision before it's a modelling decision.
PMs can run error analysis directly. You don't need to understand the model internals — you need to be able to look at an output, understand why it was wrong, and assign it a category. The categorisation itself is a product judgment: which failure types are most harmful? Which are easiest to fix? Which are strategically important even if numerically small?
PM Insight
Before your team commits to a sprint of model improvements, ask: "Can we spend half a day doing error analysis first?" A tally of 100 mistakes — categorised by type — is often more valuable than a week of model tuning in the wrong direction. It also makes the improvement roadmap defensible: you're not guessing what to fix, you're counting.
Calibration: do the scores mean what they say?
A model that outputs a 0.8 churn probability — does that mean there's an 80% chance of churn? Not necessarily. Many models are poorly calibrated: their scores rank users correctly (high scores = higher risk) but the probabilities themselves are miscalibrated.
Calibration matters when you're using the score as a probability rather than just a ranking. If your intervention team uses a 0.7 threshold expecting to reach users with 70%+ churn probability, a miscalibrated model could mean those users actually have a 40% churn rate.
Calibration check
Group predictions into buckets (0–0.1, 0.1–0.2, … 0.9–1.0). For each bucket, check the actual positive rate. If the model is well-calibrated, predictions of 0.3 should correspond to ~30% actual positive rate. A calibration plot makes this immediately visible.
Model degradation: the slow failure mode
A model trained six months ago was trained on six-month-old data. The world — and your users — change. This is called distribution shift: the statistical patterns in real-world data gradually diverge from what the model learned, and its performance quietly erodes. Without monitoring, you won't notice until the damage is done.
Common causes of degradation:
- Concept drift — the relationship between inputs and outcomes changes. A fraud model trained before a major shift in payment behaviour — a new platform, a competitor's collapse, a regulatory change — may not recognise the new patterns that emerge.
- Data drift — the characteristics of incoming data change. New user cohorts behave differently; seasonal patterns shift; a new platform drives different device types.
- Pipeline drift — upstream data sources change their format, coverage, or timing without the model team being notified. The model gets subtly different inputs than it was trained on.
- Feedback loop degradation — the model's own predictions change user behaviour, which then changes the training data for the next version. The model starts chasing its own tail.
PM Insight
Before launch, get agreement on: what's the monitoring plan? What metric triggers a retraining or rollback? Who owns model health? "We'll check in if something looks wrong" is not a plan. Define thresholds. Define owners. Define cadence.
Evaluation for generative AI: a different problem
Everything above applies to predictive ML — classifiers and regression models with well-defined outputs. Generative AI (LLMs, image models) is harder to evaluate because there's no single right answer.
Common approaches:
- Human evaluation — raters score outputs on relevance, quality, safety. Gold standard but slow and expensive.
- LLM-as-judge — use a capable LLM to evaluate outputs from another model. Fast and scalable but inherits the judge model's biases.
- Automated metrics — algorithms like BLEU and ROUGE measure how much a generated response overlaps with a reference answer (word by word). BERTScore does the same but based on meaning rather than exact words. Useful for catching regressions between model versions but all are poorly correlated with what humans actually consider high quality.
- Task-specific evals — define concrete tasks (answer these 200 questions correctly, follow these instructions, refuse these harmful prompts) and measure pass rate.
PM Playbook — Questions to ask
- What are the acceptance criteria — and were they set before the model was built?
- What is the one optimising metric? — if the team can't name it, the product goal isn't clear yet
- What are the satisficing constraints? — latency, false positive rate, cost per inference; any improvement that breaks a constraint is not a ship
- Have we done error analysis? — before committing to a model improvement sprint, spend half a day categorising ~100 mistakes
- What's the offline metric, and what's the expected online impact? — both matter; neither alone is sufficient
- Is the model calibrated? — especially if the probability score is being used for decisions, not just ranking
- What does an error cost in business terms? — translate MAE or false positives into dollars, user experience, or operational load
- What's the monitoring plan post-launch? — who owns it, what triggers action, how often is performance reviewed?
- What's the retraining cadence? — and who decides when to retrain vs roll back vs shut down?