Chapter 07

Supervised Learning

Classification vs regression, how decision trees think, and why "accuracy" is often the wrong metric to care about.

⏱ 25 min read 📊 Includes 3 interactives

What supervised learning is

Supervised learning is the most common type of ML in production. You give a model labeled examples — inputs paired with known outputs — and it learns to predict the output for inputs it hasn't seen before.

"Supervised" refers to the labels: someone (or some system) had to provide the answers during training. A spam filter trained on emails marked "spam" or "not spam." A churn model trained on users who did or didn't cancel. A price estimator trained on historical sales data. All supervised.

Every supervised learning problem is either a classification problem or a regression problem. This distinction matters because it shapes what your model outputs, how you evaluate it, and what "good" looks like.

Classification vs regression

Type	Output	Product examples	Evaluation
Classification	A category (spam/not spam, churn/stay, A/B/C)	Fraud detection, content moderation, churn prediction, lead scoring	Precision, recall, F1, AUC
Regression	A continuous number	Revenue forecasting, demand prediction, estimated delivery time, pricing	RMSE, MAE, R²

Many real problems sit at the boundary. Churn prediction is a classification problem — but classifiers typically output a probability score, which is then thresholded into "high risk" / "low risk" decisions. Understanding which you're dealing with determines how you evaluate success.

PM Insight

When your team says "the model is 87% accurate," the first question is: accurate at what? Classification accuracy and regression accuracy are completely different things. And as you're about to see, classification accuracy is often the least useful metric to optimize.

How decision trees think

Decision trees are one of the most intuitive ML models — and understanding them gives you a mental model that applies to more complex algorithms too.

A decision tree makes predictions by asking a sequence of yes/no questions about the input features. Each question splits the data further until the model arrives at a prediction.

Example — Churn prediction tree

Days since last login > 14?
├─ YES → Feature adoption < 3?
│         ├─ YES → HIGH CHURN RISK
│         └─ NO → MEDIUM RISK
└─ NO → Plan = Free?
            ├─ YES → MEDIUM RISK
            └─ NO → LOW RISK

Decision trees are highly interpretable — you can trace exactly why a prediction was made. This makes them popular in regulated industries (credit, healthcare) where explainability is required. Their weakness is that deep trees overfit easily.

In practice, teams often use random forests (many trees averaged) or gradient boosting (trees built sequentially to fix each other's errors). These are less interpretable but significantly more accurate.

PM Insight

Ask your team: does this use case require explainability? If a user is denied a loan or flagged for review, they may be legally entitled to a reason. A random forest can't give you one. A single decision tree can. Accuracy vs interpretability is a product decision, not just a technical one.

Common modeling methods

Decision trees are a useful mental model, but in practice your team will reach for a small set of standard algorithms depending on the problem. Here's what each one is, when it's used, and what you as a PM need to know about the tradeoffs.

Model	Problem type	Interpretable?	Typical accuracy	When to push for it
Logistic regression	Classification	Yes — coefficients	Modest baseline	Regulated use cases, need to explain decisions, quick baseline
Linear regression	Regression	Yes — coefficients	Modest baseline	Revenue or demand forecasting, interpretable trends
Random forest	Both	Partially (feature importance)	Good	Better accuracy than a single tree; robust, few hyperparameters to tune
Gradient boosting XGBoost, LightGBM	Both	No	Excellent on tabular data	Production accuracy ceiling on structured data; the default choice in industry
Neural networks	Both	No (black box)	State of the art on images, text, audio	Unstructured data — images, language, time series. Covered in Ch 10.

Logistic regression

Despite the name, logistic regression is a classification model. It calculates a weighted sum of the input features, then squashes the result into a probability between 0 and 1. Each feature gets a coefficient showing its direction and weight: "every extra day since last login increases churn probability by X." This makes it fully auditable and the right choice when you need to explain individual decisions to regulators or users.

Linear regression

The regression counterpart — outputs a continuous number instead of a probability. Fits a line (or hyperplane) through the training data; each feature has a coefficient. "Each additional $1 in average order value predicts $0.14 more revenue per user per month." Interpretable, fast, and a useful baseline before reaching for more complex models.

Random forests

A random forest builds hundreds of decision trees — each on a random subset of the data and a random subset of features — then averages their predictions. The individual trees make different errors that cancel out in aggregate, which dramatically reduces overfitting compared to a single tree. You lose the "trace the path" interpretability, but you gain feature importance scores: which features the forest relied on most overall. Robust and hard to break with bad settings.

Gradient boosting (XGBoost, LightGBM)

Where random forests build trees in parallel and average them, gradient boosting builds trees sequentially — each new tree focuses on correcting the errors of the previous one. The result is the highest-accuracy option for tabular data in industry. XGBoost and LightGBM are the two dominant libraries; LightGBM is faster and preferred for large datasets. The tradeoff: more hyperparameters to tune, and no clean per-prediction explanation without tools like SHAP.

PM Insight

If your team is spending weeks tuning a neural network on tabular data (user features, transaction records, product attributes), ask whether gradient boosting has been tried. It typically outperforms neural networks on structured data and trains in minutes, not hours. The rough rule: logistic regression as the interpretable baseline, gradient boosting as the accuracy ceiling, random forest as a robust middle ground.

Overfitting: when more complexity hurts

Every model has a complexity dial — tree depth, number of trees, number of layers. At low complexity, the model is too simple to capture real patterns in the data (underfitting). At high complexity, it memorises the training data so thoroughly that it fails on new data it's never seen (overfitting). The goal is the sweet spot in between.

The hallmark of overfitting: training error keeps falling as complexity increases, but test error stops improving and starts rising. A model that's 99% accurate on training data and 67% accurate in production is not a good model — it's a model that memorised the exam rather than learning the material.

Interactive — Overfitting Visualizer

Green line = training error | Red dashed = test error | Drag the slider to increase model complexity.

Model complexity 5

← Underfitting (too simple) (memorises training data) Overfitting →

Why accuracy is the wrong metric

Imagine you're building a fraud detection model. 99% of transactions are legitimate. A model that predicts "not fraud" for every single transaction achieves 99% accuracy — and catches zero fraud. Congratulations on a useless model.

This is the class imbalance problem, and it's extremely common in product ML: churn rates of 3–5%, fraud rates under 1%, click-through rates of 0.1%. When the thing you care about is rare, accuracy is meaningless.

Instead, you need metrics that measure the specific types of errors that matter for your use case.

The four outcomes of any classification

For every prediction a classifier makes, one of four things happens. The model predicts churn (positive) or no-churn (negative), and the user either churns or doesn't.

Precision, recall, and the threshold tradeoff

Most classifiers don't output a hard yes/no — they output a probability score. "This user has a 0.73 probability of churning." A threshold converts that score into a decision: above the threshold → predict churn; below → predict stay.

Where you set that threshold is a product decision. Lower it and you flag more users as churn risk — catching more real churners (higher recall) but also wasting intervention budget on users who would have stayed (lower precision). Raise it and you're selective but miss real churners.

The visualizer below shows a synthetic churn model. Drag the threshold and watch how the confusion matrix and all four metrics shift.

Interactive — Classification Threshold Explorer

Red bars = churners | Blue bars = users who stayed | Orange line = your threshold

Classification threshold 0.50

← Flag everyone (high recall) (high precision) Flag few →

Predicted: Churn

Predicted: Stay

Actually churned

—

True Positive ✓

—

False Negative ✗

Actually stayed

—

False Positive ✗

—

True Negative ✓

Accuracy

—

Precision

—

Recall

—

F1 Score

—

Precision and recall: a plain-language guide

Precision — "When we flag someone, how often are we right?"

Of all users predicted to churn, what fraction actually churned? Low precision = lots of wasted interventions. If your team calls 100 "at-risk" users and 80 were never going to churn, you've wasted 80 calls and possibly annoyed loyal customers.

Recall — "Of all the churners, how many did we catch?"

Of all users who actually churned, what fraction did the model flag? Low recall = churners slipping through. If 60% of churners are never flagged, your intervention program is only reaching 40% of the people who need it.

F1 Score — "The harmonic mean of precision and recall"

The harmonic mean penalises whichever of precision or recall is lower — a model with precision 0.9 and recall 0.1 has an F1 of 0.18, not 0.5. This makes F1 more honest than a simple average when one dimension is weak. When one error type matters significantly more than the other, use a weighted variant instead: F2 emphasises recall; F0.5 emphasises precision.

Precision vs recall: which matters more?

It depends on the cost of each error type — which is a business decision, not a statistics one.

Optimize for precision when false positives are expensive or harmful. Wrongly flagging someone for fraud means a blocked card, a frustrated customer, a support call. Better to miss some fraud than to alienate good customers.
Optimize for recall when false negatives are expensive. Missed cancer diagnoses, missed security breaches, missed churn before a contract renewal — here, catching everything matters more than occasional false alarms.

PM Insight

Your team shouldn't decide the precision/recall tradeoff — you should. It's a business decision about the relative cost of different errors. "What's worse: flagging a happy user as at-risk, or missing a churner?" The answer sets where you put the threshold.

AUC: evaluating the model itself, not the threshold

Precision and recall depend on your threshold choice. But how do you evaluate whether the model's underlying scores are any good — before deciding on a threshold?

AUC stands for Area Under the ROC Curve. The ROC curve (Receiver Operating Characteristic — a name inherited from radar engineering) is a chart that plots how a model's true positive rate vs false positive rate changes as you move the threshold from 0 to 1. AUC is simply the area under that curve — a single number that captures how well the model separates the two classes across all possible thresholds. An AUC of 1.0 is perfect; 0.5 means the model is no better than random guessing.

AUC > 0.9 — excellent discrimination. The model separates classes very well across all thresholds.
AUC 0.75–0.9 — good. Useful in most product contexts.
AUC 0.6–0.75 — modest. Better than random, but threshold matters a lot.
AUC < 0.6 — weak signal. Be very cautious about acting on model outputs.

PM Insight

AUC tells you whether the model can rank users correctly — higher-risk users getting higher scores than lower-risk ones. It doesn't tell you whether the scores are calibrated (i.e. whether a score of 0.7 actually means 70% probability). For ranking use cases (recommender systems, search), AUC is the right headline. For intervention use cases (churn, fraud), also check calibration.

Interactive — ROC Curve

The full ROC curve for the same churn model. Each point on the curve is one threshold setting. Drag the slider to move the orange dot and see how TPR and FPR shift. The shaded region is the AUC.

Classification threshold 0.50

← Flag everyone (high recall, high FPR) (low FPR, low recall) Flag few →

Explainability: why did the model predict that?

A model predicts that a user has an 84% probability of churning. Your customer success team asks: why? A user is denied a loan. Regulations say they're entitled to a reason. A content moderation decision is appealed. Support needs to explain it.

The answer depends entirely on which type of model you're using — and whether you've planned for explainability from the start.

Interpretable by design

Some models are transparent: you can read their logic directly.

Logistic regression — each feature has a coefficient that shows its direction and weight. "Every additional day since last login increases churn probability by X." Fully auditable.
Decision trees — predictions follow a traceable path of rules (as in the example above). You can show a user the exact sequence of conditions that led to their outcome.

The tradeoff: these models are often less accurate than more complex alternatives. In many use cases, that's an acceptable price. In regulated contexts, it may be the only acceptable option.

Post-hoc explanation for black-box models

When you need accuracy and explainability, post-hoc explanation methods generate a human-readable explanation of a prediction after the model has made it — without changing the model itself.

Feature importance vs per-prediction explanation

Feature importance is a global view: across the whole dataset, which features did the model rely on most? Useful for understanding the model's general behaviour, catching bias, and checking that the model is using sensible signals. It does not tell you why any single prediction was made.

SHAP (SHapley Additive exPlanations) gives a per-prediction breakdown: for this specific user, how much did each feature push the score up or down from the baseline?

User #4821 — churn probability: 0.82
↑ +0.31 — no login in 18 days
↑ +0.22 — zero mobile app sessions this month
↑ +0.18 — support ticket opened this week
↓ −0.09 — on annual plan (stabilising factor)

LIME (Local Interpretable Model-agnostic Explanations) works similarly — it fits a simple interpretable model around a single prediction to approximate the complex model's local behaviour.

The honest limitation

Post-hoc explanations are approximations. SHAP and LIME describe what a model appears to be doing for a given prediction — they don't give you a mathematical proof of causality. For complex models with thousands of interacting features, the explanation is a useful approximation, not a complete account. Good enough for most product decisions; not sufficient for high-stakes legal contexts without additional validation.

When explainability is a legal requirement

In some contexts, the ability to explain a decision isn't optional:

Consumer credit and lending — regulations in the US (ECOA) and EU require that applicants denied credit receive a specific reason. "The model said so" is not compliant.
GDPR Article 22 — individuals subject to fully automated decisions with significant effects have the right to an explanation and the right to human review.
EU AI Act — high-risk AI systems (hiring, education, credit, law enforcement) require transparency and documentation of model logic.
Hiring and HR — using ML in recruitment decisions is heavily scrutinised in most jurisdictions; unexplainable rejection reasons create legal exposure.

PM Insight

Explainability requirements should be established at kickoff — not retrofitted after a model is built. "Does this use case involve consequential decisions about individuals?" If the answer is yes, constrain the model choice up front, or budget for SHAP integration and human review workflows before you commit to a launch date. Discovering you need explainability after you've shipped a neural network is expensive.

PM Playbook — Questions to ask

Is this classification or regression? — sets the right evaluation framework
What's the class balance? — if the target is rare, accuracy is meaningless
What's the cost of a false positive vs a false negative? — you define this, not the data team
What threshold are we using and why? — it should be driven by business costs, not default 0.5
Show me the confusion matrix, not just accuracy — always
What's the AUC on the held-out test set? — and is that test set truly representative?
Does this use case involve consequential decisions about individuals? — if yes, establish explainability requirements at kickoff: constrain model choice, plan for SHAP integration, and define the human review workflow
Are there regulatory requirements for explanation? — credit, hiring, healthcare, and GDPR-covered automated decisions all have specific obligations

Check your understanding 4 questions