Chapter 06
How ML Actually Works
Training, features, labels, overfitting — what a machine learning model really does, without the equations.
ML is pattern-finding, not rule-writing
Traditional software is explicit. A developer writes rules: if the user has been inactive for 30 days, send a re-engagement email. The logic is readable, auditable, and static. It does exactly what you tell it.
Machine learning flips this. Instead of writing rules, you show the system examples — thousands or millions of them — and let it figure out the rules itself. You don't tell the model "inactive for 30 days = churn risk." You show it 500,000 historical users, their behavior, and whether they churned, and it learns which patterns predict churn.
This is powerful when the rules are too complex to write — like recognizing spam, ranking search results, or detecting fraud. It's a liability when you can't explain what rules the model learned, or when the patterns in historical data don't reflect the world you're shipping into.
PM Insight
"Just use ML" is often said in product meetings as if it's a straightforward upgrade from rules. It's not. ML trades interpretability and control for pattern-matching power. Knowing when that trade is worth it — and when it isn't — is one of the most valuable things a PM can bring to a technical conversation.
The ingredients: features and labels
Every ML model learns from two things:
Features — the inputs
Features are the pieces of information you give the model to make a prediction. For a churn model: days since last login, number of sessions last month, plan tier, number of errors encountered, feature adoption rate. Each of these is a feature.
Feature selection is where a huge amount of the actual ML work happens — and where domain knowledge (yours) is irreplaceable. The model can only learn from patterns in the features you give it. If you don't include a feature, the model can't use it. If you include a feature that leaks future information into the past, the model will appear to work brilliantly — and fail in production.
Example: temporal leakage
You're building a churn model. Someone adds a feature: "number of support tickets raised in the 30 days before the churn date." In training data, this is available — the month is already over. The model discovers it's highly predictive and scores 96% accuracy.
In production, you're predicting whether a user will churn over the next 30 days. Those future tickets don't exist yet. The feature that made the model look brilliant simply can't be computed — and the model fails silently.
The model didn't learn "this user seems frustrated." It learned "this user already churned." It was handed the answer during training, disguised as an input.
Labels — what you're predicting
Labels are the answers you want the model to learn. Did this user churn? (yes/no.) What did this user click on? How much revenue will this account generate? Labels are typically drawn from historical data — you know what actually happened.
The label definition problem
How you define the label shapes everything the model learns. "Churn" defined as "no login in 30 days" vs "subscription cancelled" trains completely different models. A mismatch between your label definition and the business problem you're solving is one of the most common — and most invisible — ML failure modes.
How training works
Training is the process of finding the model parameters that best explain the patterns in your data. Here's the intuition:
Start with a guess
The model begins with random (or initialized) parameters — essentially no knowledge.
Make predictions on training data
Feed examples through the model and see what it predicts. At first, predictions are terrible.
Measure the error
Compare predictions to the true labels. The difference is the loss — how wrong the model is.
Adjust parameters to reduce error
Using an algorithm called gradient descent — think of it like finding the lowest point in a hilly landscape by always taking a small step downhill — the model nudges its parameters in the direction that reduces the error.
Repeat — many times
Thousands or millions of iterations later, the model has found parameters that fit the training data well.
"Fitting the training data well" sounds like the goal. It isn't. The goal is fitting patterns that also hold true on new data the model has never seen. This is where overfitting enters the picture.
The train/test split: why you need held-out data
Imagine studying for an exam by memorizing the answer key. You'll score 100% on those exact questions — and fail completely on anything slightly different. A model that memorizes training data has the same problem.
To catch this, data scientists split their data before training begins:
- Training set — what the model learns from (typically 70–80% of data)
- Validation set — used to tune the model during development
- Test set — held out entirely, used once to evaluate final performance
If a model performs well on training data but poorly on the test set, it has overfit — it memorized the training data instead of learning generalizable patterns.
PM Insight
Always ask: what's the performance on the held-out test set, not just the training set? If your team only reports training accuracy, they either don't have a test set (red flag) or the test set wasn't truly held out (also a red flag).
Overfitting: the central tension in ML
Overfitting happens when a model is too complex — it learns the noise in the training data, not just the underlying signal. It memorizes quirks that are specific to the training set and won't appear in production.
The opposite — underfitting — happens when the model is too simple to capture real patterns. Both extremes hurt performance.
The visualization below makes this concrete. Drag the slider to increase model complexity (polynomial degree). Watch how training error and test error diverge as complexity grows past the sweet spot.
Interactive — Overfitting Visualizer
Drag the slider to change model complexity. Blue dots are training data. Grey dots are test data. The green dashed line is the true underlying pattern.
What keeps overfitting in check
Data scientists use several techniques to prevent overfitting:
- More training data — harder to memorize noise when there's more signal
- Regularization — a penalty that discourages overly complex models
- Cross-validation — evaluating on multiple held-out subsets, not just one
- Early stopping — halting training before the model starts memorizing
- Simpler model architecture — sometimes the right call is a less powerful model
The data quality problem nobody talks about enough
A model trained on biased, incomplete, or stale data will learn biased, incomplete, or stale patterns. It has no way to know its training data was bad — it just learns whatever patterns exist in what it was given.
Common data problems that silently poison models:
- Selection bias — training data isn't representative of the population you're predicting on. A fraud model trained only on detected fraud misses undetected patterns by definition.
- Label noise — the labels themselves are wrong. Human annotators disagree, automated labels have errors, and definitions shift over time.
- Feature leakage — a feature that accidentally encodes the answer. A model that uses "refund requested" to predict "user is unhappy" is using a consequence as a cause.
- Distribution shift — the world changes, but the training data doesn't. A pricing model trained before a major competitor launched a free tier will be confidently wrong about what users are willing to pay.
PM Insight
Before asking "is the model good?", ask "is the training data trustworthy?" Garbage in, garbage out — but the garbage is invisible in the model metrics. A model that looks great on held-out test data from the same flawed dataset will still fail in production.
From training to production
Training a model is not the same as deploying one. Once a model passes evaluation, several things need to happen before it reaches users:
- Serving infrastructure — the model needs to make predictions at the latency your product requires. Milliseconds for real-time ranking; minutes for batch offline scoring.
- Feature pipelines — the same features used in training need to be available at inference time, in the same format. Mismatches here cause silent failures.
- Monitoring — model performance needs to be tracked in production. Models degrade as the world changes. Without monitoring, you won't know when it happens.
- Retraining — a plan for how and when to retrain on fresh data.
PM Insight
When your data scientist says "the model is done," ask what "done" means. A trained model sitting in a notebook is very different from a model in production with monitoring, a retraining pipeline, and a rollback plan. Scope all of that into the project before committing to a launch date.
PM Playbook — Questions to ask
- What problem are we replacing rules with ML for — and is that trade worth it?
- What are the features, and who defined them? — domain expertise in feature selection is often the highest-leverage work
- How is the label defined? — and does it map to the business outcome we actually care about?
- What's the test set performance? — not training accuracy, and ideally on data from a different time period
- Is there any risk of feature leakage? — does any feature accidentally encode the answer?
- How representative is the training data? — of our current users, current product, current world?
- What happens when the model is wrong? — what does the user experience, and can we recover?
- What's the monitoring and retraining plan? — before launch, not after