Chapter 07
Supervised Learning
Classification vs regression, how decision trees think, and why "accuracy" is often the wrong metric to care about.
What supervised learning is
Supervised learning is the most common type of ML in production. You give a model labeled examples — inputs paired with known outputs — and it learns to predict the output for inputs it hasn't seen before.
"Supervised" refers to the labels: someone (or some system) had to provide the answers during training. A spam filter trained on emails marked "spam" or "not spam." A churn model trained on users who did or didn't cancel. A price estimator trained on historical sales data. All supervised.
Every supervised learning problem is either a classification problem or a regression problem. This distinction matters because it shapes what your model outputs, how you evaluate it, and what "good" looks like.
Classification vs regression
| Type | Output | Product examples | Evaluation |
|---|---|---|---|
| Classification | A category (spam/not spam, churn/stay, A/B/C) | Fraud detection, content moderation, churn prediction, lead scoring | Precision, recall, F1, AUC |
| Regression | A continuous number | Revenue forecasting, demand prediction, estimated delivery time, pricing | RMSE, MAE, R² |
Many real problems sit at the boundary. Churn prediction is a classification problem — but classifiers typically output a probability score, which is then thresholded into "high risk" / "low risk" decisions. Understanding which you're dealing with determines how you evaluate success.
PM Insight
When your team says "the model is 87% accurate," the first question is: accurate at what? Classification accuracy and regression accuracy are completely different things. And as you're about to see, classification accuracy is often the least useful metric to optimize.
How decision trees think
Decision trees are one of the most intuitive ML models — and understanding them gives you a mental model that applies to more complex algorithms too.
A decision tree makes predictions by asking a sequence of yes/no questions about the input features. Each question splits the data further until the model arrives at a prediction.
Example — Churn prediction tree
Days since last login > 14?
├─ YES → Feature adoption < 3?
│ ├─ YES → HIGH CHURN RISK
│ └─ NO → MEDIUM RISK
└─ NO → Plan = Free?
├─ YES → MEDIUM RISK
└─ NO → LOW RISK
Decision trees are highly interpretable — you can trace exactly why a prediction was made. This makes them popular in regulated industries (credit, healthcare) where explainability is required. Their weakness is that deep trees overfit easily.
In practice, teams often use random forests (many trees averaged) or gradient boosting (trees built sequentially to fix each other's errors). These are less interpretable but significantly more accurate.
PM Insight
Ask your team: does this use case require explainability? If a user is denied a loan or flagged for review, they may be legally entitled to a reason. A random forest can't give you one. A single decision tree can. Accuracy vs interpretability is a product decision, not just a technical one.
Common modeling methods
Decision trees are a useful mental model, but in practice your team will reach for a small set of standard algorithms depending on the problem. Here's what each one is, when it's used, and what you as a PM need to know about the tradeoffs.
| Model | Problem type | Interpretable? | Typical accuracy | When to push for it |
|---|---|---|---|---|
| Logistic regression | Classification | Yes — coefficients | Modest baseline | Regulated use cases, need to explain decisions, quick baseline |
| Linear regression | Regression | Yes — coefficients | Modest baseline | Revenue or demand forecasting, interpretable trends |
| Random forest | Both | Partially (feature importance) | Good | Better accuracy than a single tree; robust, few hyperparameters to tune |
| Gradient boosting XGBoost, LightGBM |
Both | No | Excellent on tabular data | Production accuracy ceiling on structured data; the default choice in industry |
| Neural networks | Both | No (black box) | State of the art on images, text, audio | Unstructured data — images, language, time series. Covered in Ch 10. |
Logistic regression
Despite the name, logistic regression is a classification model. It calculates a weighted sum of the input features, then squashes the result into a probability between 0 and 1. Each feature gets a coefficient showing its direction and weight: "every extra day since last login increases churn probability by X." This makes it fully auditable and the right choice when you need to explain individual decisions to regulators or users.
Linear regression
The regression counterpart — outputs a continuous number instead of a probability. Fits a line (or hyperplane) through the training data; each feature has a coefficient. "Each additional $1 in average order value predicts $0.14 more revenue per user per month." Interpretable, fast, and a useful baseline before reaching for more complex models.
Random forests
A random forest builds hundreds of decision trees — each on a random subset of the data and a random subset of features — then averages their predictions. The individual trees make different errors that cancel out in aggregate, which dramatically reduces overfitting compared to a single tree. You lose the "trace the path" interpretability, but you gain feature importance scores: which features the forest relied on most overall. Robust and hard to break with bad settings.
Gradient boosting (XGBoost, LightGBM)
Where random forests build trees in parallel and average them, gradient boosting builds trees sequentially — each new tree focuses on correcting the errors of the previous one. The result is the highest-accuracy option for tabular data in industry. XGBoost and LightGBM are the two dominant libraries; LightGBM is faster and preferred for large datasets. The tradeoff: more hyperparameters to tune, and no clean per-prediction explanation without tools like SHAP.
PM Insight
If your team is spending weeks tuning a neural network on tabular data (user features, transaction records, product attributes), ask whether gradient boosting has been tried. It typically outperforms neural networks on structured data and trains in minutes, not hours. The rough rule: logistic regression as the interpretable baseline, gradient boosting as the accuracy ceiling, random forest as a robust middle ground.
Overfitting: when more complexity hurts
Every model has a complexity dial — tree depth, number of trees, number of layers. At low complexity, the model is too simple to capture real patterns in the data (underfitting). At high complexity, it memorises the training data so thoroughly that it fails on new data it's never seen (overfitting). The goal is the sweet spot in between.
The hallmark of overfitting: training error keeps falling as complexity increases, but test error stops improving and starts rising. A model that's 99% accurate on training data and 67% accurate in production is not a good model — it's a model that memorised the exam rather than learning the material.
Interactive — Overfitting Visualizer
Green line = training error | Red dashed = test error | Drag the slider to increase model complexity.
Why accuracy is the wrong metric
Imagine you're building a fraud detection model. 99% of transactions are legitimate. A model that predicts "not fraud" for every single transaction achieves 99% accuracy — and catches zero fraud. Congratulations on a useless model.
This is the class imbalance problem, and it's extremely common in product ML: churn rates of 3–5%, fraud rates under 1%, click-through rates of 0.1%. When the thing you care about is rare, accuracy is meaningless.
Instead, you need metrics that measure the specific types of errors that matter for your use case.
The four outcomes of any classification
For every prediction a classifier makes, one of four things happens. The model predicts churn (positive) or no-churn (negative), and the user either churns or doesn't.
Precision, recall, and the threshold tradeoff
Most classifiers don't output a hard yes/no — they output a probability score. "This user has a 0.73 probability of churning." A threshold converts that score into a decision: above the threshold → predict churn; below → predict stay.
Where you set that threshold is a product decision. Lower it and you flag more users as churn risk — catching more real churners (higher recall) but also wasting intervention budget on users who would have stayed (lower precision). Raise it and you're selective but miss real churners.
The visualizer below shows a synthetic churn model. Drag the threshold and watch how the confusion matrix and all four metrics shift.
Interactive — Classification Threshold Explorer
Red bars = churners | Blue bars = users who stayed | Orange line = your threshold
Precision and recall: a plain-language guide
Precision — "When we flag someone, how often are we right?"
Of all users predicted to churn, what fraction actually churned? Low precision = lots of wasted interventions. If your team calls 100 "at-risk" users and 80 were never going to churn, you've wasted 80 calls and possibly annoyed loyal customers.
Recall — "Of all the churners, how many did we catch?"
Of all users who actually churned, what fraction did the model flag? Low recall = churners slipping through. If 60% of churners are never flagged, your intervention program is only reaching 40% of the people who need it.
F1 Score — "The harmonic mean of precision and recall"
The harmonic mean penalises whichever of precision or recall is lower — a model with precision 0.9 and recall 0.1 has an F1 of 0.18, not 0.5. This makes F1 more honest than a simple average when one dimension is weak. When one error type matters significantly more than the other, use a weighted variant instead: F2 emphasises recall; F0.5 emphasises precision.
Precision vs recall: which matters more?
It depends on the cost of each error type — which is a business decision, not a statistics one.
- Optimize for precision when false positives are expensive or harmful. Wrongly flagging someone for fraud means a blocked card, a frustrated customer, a support call. Better to miss some fraud than to alienate good customers.
- Optimize for recall when false negatives are expensive. Missed cancer diagnoses, missed security breaches, missed churn before a contract renewal — here, catching everything matters more than occasional false alarms.
PM Insight
Your team shouldn't decide the precision/recall tradeoff — you should. It's a business decision about the relative cost of different errors. "What's worse: flagging a happy user as at-risk, or missing a churner?" The answer sets where you put the threshold.
AUC: evaluating the model itself, not the threshold
Precision and recall depend on your threshold choice. But how do you evaluate whether the model's underlying scores are any good — before deciding on a threshold?
AUC stands for Area Under the ROC Curve. The ROC curve (Receiver Operating Characteristic — a name inherited from radar engineering) is a chart that plots how a model's true positive rate vs false positive rate changes as you move the threshold from 0 to 1. AUC is simply the area under that curve — a single number that captures how well the model separates the two classes across all possible thresholds. An AUC of 1.0 is perfect; 0.5 means the model is no better than random guessing.
- AUC > 0.9 — excellent discrimination. The model separates classes very well across all thresholds.
- AUC 0.75–0.9 — good. Useful in most product contexts.
- AUC 0.6–0.75 — modest. Better than random, but threshold matters a lot.
- AUC < 0.6 — weak signal. Be very cautious about acting on model outputs.
PM Insight
AUC tells you whether the model can rank users correctly — higher-risk users getting higher scores than lower-risk ones. It doesn't tell you whether the scores are calibrated (i.e. whether a score of 0.7 actually means 70% probability). For ranking use cases (recommender systems, search), AUC is the right headline. For intervention use cases (churn, fraud), also check calibration.
Interactive — ROC Curve
The full ROC curve for the same churn model. Each point on the curve is one threshold setting. Drag the slider to move the orange dot and see how TPR and FPR shift. The shaded region is the AUC.
Explainability: why did the model predict that?
A model predicts that a user has an 84% probability of churning. Your customer success team asks: why? A user is denied a loan. Regulations say they're entitled to a reason. A content moderation decision is appealed. Support needs to explain it.
The answer depends entirely on which type of model you're using — and whether you've planned for explainability from the start.
Interpretable by design
Some models are transparent: you can read their logic directly.
- Logistic regression — each feature has a coefficient that shows its direction and weight. "Every additional day since last login increases churn probability by X." Fully auditable.
- Decision trees — predictions follow a traceable path of rules (as in the example above). You can show a user the exact sequence of conditions that led to their outcome.
The tradeoff: these models are often less accurate than more complex alternatives. In many use cases, that's an acceptable price. In regulated contexts, it may be the only acceptable option.
Post-hoc explanation for black-box models
When you need accuracy and explainability, post-hoc explanation methods generate a human-readable explanation of a prediction after the model has made it — without changing the model itself.
Feature importance vs per-prediction explanation
Feature importance is a global view: across the whole dataset, which features did the model rely on most? Useful for understanding the model's general behaviour, catching bias, and checking that the model is using sensible signals. It does not tell you why any single prediction was made.
SHAP (SHapley Additive exPlanations) gives a per-prediction breakdown: for this specific user, how much did each feature push the score up or down from the baseline?
User #4821 — churn probability: 0.82
↑ +0.31 — no login in 18 days
↑ +0.22 — zero mobile app sessions this month
↑ +0.18 — support ticket opened this week
↓ −0.09 — on annual plan (stabilising factor)
LIME (Local Interpretable Model-agnostic Explanations) works similarly — it fits a simple interpretable model around a single prediction to approximate the complex model's local behaviour.
The honest limitation
Post-hoc explanations are approximations. SHAP and LIME describe what a model appears to be doing for a given prediction — they don't give you a mathematical proof of causality. For complex models with thousands of interacting features, the explanation is a useful approximation, not a complete account. Good enough for most product decisions; not sufficient for high-stakes legal contexts without additional validation.
When explainability is a legal requirement
In some contexts, the ability to explain a decision isn't optional:
- Consumer credit and lending — regulations in the US (ECOA) and EU require that applicants denied credit receive a specific reason. "The model said so" is not compliant.
- GDPR Article 22 — individuals subject to fully automated decisions with significant effects have the right to an explanation and the right to human review.
- EU AI Act — high-risk AI systems (hiring, education, credit, law enforcement) require transparency and documentation of model logic.
- Hiring and HR — using ML in recruitment decisions is heavily scrutinised in most jurisdictions; unexplainable rejection reasons create legal exposure.
PM Insight
Explainability requirements should be established at kickoff — not retrofitted after a model is built. "Does this use case involve consequential decisions about individuals?" If the answer is yes, constrain the model choice up front, or budget for SHAP integration and human review workflows before you commit to a launch date. Discovering you need explainability after you've shipped a neural network is expensive.
PM Playbook — Questions to ask
- Is this classification or regression? — sets the right evaluation framework
- What's the class balance? — if the target is rare, accuracy is meaningless
- What's the cost of a false positive vs a false negative? — you define this, not the data team
- What threshold are we using and why? — it should be driven by business costs, not default 0.5
- Show me the confusion matrix, not just accuracy — always
- What's the AUC on the held-out test set? — and is that test set truly representative?
- Does this use case involve consequential decisions about individuals? — if yes, establish explainability requirements at kickoff: constrain model choice, plan for SHAP integration, and define the human review workflow
- Are there regulatory requirements for explanation? — credit, hiring, healthcare, and GDPR-covered automated decisions all have specific obligations