Chapter 08
Unsupervised Learning
Clustering, user segmentation, and how ML finds structure in data with no labels to learn from.
Learning without answers
Supervised learning needs labeled data — examples with known answers. But what if you don't have labels? What if you just have a large pile of user behaviour data and want to know: are there natural groups in here?
That's unsupervised learning. No labels, no right answers — just algorithms that find structure in raw data. The two most useful flavours for product work are clustering (grouping similar things) and dimensionality reduction (compressing many variables into fewer, meaningful ones).
To make this concrete, we'll follow one scenario throughout the chapter. Imagine you're a PM at a B2B SaaS company. You have 4,000 customers and a dashboard full of behavioural signals: login frequency, features used, team size, support tickets opened, time-to-value. You suspect your customers fall into distinct archetypes — but nobody has labeled them. Unsupervised learning is how you find out.
PM Insight
Unsupervised learning is where a lot of exploratory product analytics lives. "Who are our users?" comes before "what do our users want?" — and it's an unsupervised question. You're discovering structure, not predicting outcomes. The answers shape roadmap priorities, messaging, and which users you talk to next.
Clustering: finding natural groups
Clustering algorithms group data points so that things within a group are more similar to each other than to things in other groups. The algorithm defines "similar" mathematically — usually by distance in feature space.
K-means is the most common clustering algorithm. It works like this:
- You choose k — the number of clusters you want
- The algorithm places k centroids (cluster centres) randomly
- Each data point is assigned to its nearest centroid
- Each centroid moves to the average position of its assigned points
- Steps 3–4 repeat until assignments stop changing
Interactive — K-Means Clustering
Each ✦ is a cluster centroid. Coloured dots are data points assigned to the nearest centroid. Step through iterations manually or hit Run to animate convergence.
What k-means can't do
K-means has real limitations that matter for product decisions:
- You must choose k. The algorithm doesn't tell you how many clusters exist — that's a judgment call. Techniques like the "elbow method" help, but choosing k is ultimately a business question: how many segments are actionable?
- Clusters are spherical. K-means assumes clusters are roughly round blobs. It struggles with elongated, irregular, or nested clusters.
- Sensitive to outliers. Extreme users can pull centroids away from the true cluster centre. Pre-processing to handle outliers is usually necessary.
- Results vary with initialisation. Different starting centroids can produce different final clusters. Always run multiple times and pick the best result.
PM Insight
When your data team says "we ran clustering and found 5 user segments," ask: Why 5? What happens at 3 or 8? The number of clusters is a choice, not a discovery. Make sure it's driven by what's actionable, not what's mathematically convenient.
Picking k: the elbow method
One way to inform the choice of k is the elbow method. Run k-means for k = 1, 2, 3 … and measure the total within-cluster variance at each step — how spread out points are inside their assigned cluster. As k increases, variance always falls (more clusters = tighter fits). But the improvement gets smaller each time. Plot it and look for the "elbow" — the point where adding another cluster stops buying you much. That's a candidate for k.
For our B2B SaaS company: running the elbow method suggests 4 clusters is where the curve bends. Below 4, customers are lumped together in ways that don't match any recognisable archetype. Above 4, the new segments are too small to act on.
Interactive — Elbow Method
Each point shows the within-cluster variance for a given k. The curve bends sharply at the elbow — that's where adding more clusters stops meaningfully reducing variance. Click a point to select k.
Other clustering algorithms
K-means is the starting point, but your data team may reach for other algorithms when the data doesn't fit k-means' assumptions. Here's what you need to know:
| Algorithm | Need to choose k? | Cluster shape | Handles noise/outliers? | When to use |
|---|---|---|---|---|
| K-means | Yes | Spherical blobs | No — outliers distort centroids | Default choice; fast, scalable, interpretable |
| DBSCAN | No | Any shape | Yes — labels outliers as noise | Irregular cluster shapes; fraud/anomaly detection; geographic data |
| Hierarchical | No (choose after) | Any shape | Partially | When you want to explore a range of k at once; small-to-medium datasets |
DBSCAN — density-based clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are densely packed together and labels isolated points as noise. You don't choose k — the algorithm finds however many clusters exist in the data based on two parameters: how close points need to be (epsilon) and how many neighbours define a cluster (min_samples). This makes it powerful for irregular shapes and outlier detection, but sensitive to those two parameters. If your team uses DBSCAN, ask them to show you how results change as epsilon varies.
Hierarchical clustering — building a tree
Hierarchical clustering builds a dendrogram — a tree that shows how data points progressively merge into clusters. You don't commit to a k upfront; instead you "cut" the tree at whatever level gives a useful number of clusters. This lets you explore the structure at multiple resolutions. The downside is cost: it scales poorly to large datasets (millions of users), so in practice it's used for smaller segmentation studies or to validate k-means results.
PM Insight
For most product segmentation work, k-means is the right default. Push for DBSCAN when you suspect your user base has outliers distorting the segments (e.g., power users or bots pulling centroids) or when cluster shapes are clearly non-spherical. Push for hierarchical clustering when you want to explore the data before committing to a number of segments.
Practical clustering: user segmentation
User segmentation is one of the highest-value applications of clustering in product. Instead of treating all users the same, you identify groups with distinct behaviour patterns — then tailor features, messaging, or interventions to each.
Back to our B2B SaaS company. Running k-means with k = 4 on login frequency, features used, support tickets, and time-to-value surfaces four distinct groups:
- Power users — daily logins, high feature breadth, near-zero support tickets. These are your advocates and expansion candidates.
- Passive observers — signed up, log in occasionally, use one or two features. High churn risk; the product hasn't found its hook yet.
- Newly activated — recent sign-ups with rising engagement. The intervention window is now — onboarding friction here becomes long-term retention.
- At-risk accounts — were once engaged, engagement has fallen. Need proactive customer success outreach before renewal.
None of this came from labels. No one tagged customers as "at-risk" — the algorithm found that group by noticing that some accounts share a pattern of declining engagement. The PM's job is to look at each cluster and ask: what does this tell us about what we should build or do next?
Segmentation vs personas
Clustering produces data-driven segments — groups that actually exist in your user base, defined by measurable behaviour. Personas are qualitative archetypes, often hand-crafted. Both are useful; they answer different questions. Clustering tells you who is here. Personas tell you what they care about. The best product teams use both: segments to find the groups, qualitative interviews to understand what drives them.
Dimensionality reduction: seeing the shape of data
Users generate hundreds of behavioural signals. You can't visualise 200-dimensional data — human brains top out at 3 dimensions. Dimensionality reduction is a technique that compresses many variables down into 2 or 3 while preserving as much structure as possible — so you can actually plot it and see what's going on.
The two most common techniques:
- PCA (Principal Component Analysis) — finds the directions where the data varies the most and projects everything onto those axes. Think of it like finding the best angle to photograph a 3D sculpture so you capture the most information in a 2D photo. Fast, interpretable, works in a straight line. Good for understanding which features drive the most variation in your user base.
- t-SNE and UMAP — more sophisticated techniques that keep similar data points close together when compressing. Better for visualising clusters but the distances between clusters aren't meaningful, and they're slower to compute. Often used to visualise word or user embeddings (numerical representations of meaning — covered in Ch 10).
PM Insight
If your data team shows you a 2D scatter plot of users and says "these are your segments," ask what technique they used. t-SNE plots look compelling but can be misleading — cluster sizes and distances between clusters are not interpretable. They show you which users are similar, not how different segments are from each other.
Anomaly detection: when a point fits no cluster
If most users cluster together and a few data points don't fit any cluster, those outliers might be anomalies — fraud, bots, errors, or genuinely unusual behaviour. DBSCAN explicitly labels these points as noise. K-means forces every point into a cluster, but you can detect anomalies by looking for points with unusually high distance to their centroid — they "belong" to a cluster only reluctantly.
Anomaly detection is used heavily in:
- Fraud and abuse — a transaction pattern unlike any normal cluster: wrong geography, wrong amount, wrong timing. The anomaly score flags it before a human reviews it.
- Infrastructure monitoring — server latency or error rates that deviate from the normal pattern for that time of day. Useful for catching incidents before users report them.
- Data quality — records with values so far from any cluster centroid that they're likely corrupted, malformed, or from a different source. Catching these before they reach downstream models prevents silent model degradation.
- Content moderation — posts or accounts whose behaviour pattern sits far from any normal user cluster, flagging potential bot networks or coordinated abuse.
When anomaly detection goes wrong
Anomaly detection fails in two directions. Set the sensitivity too high and you flag everything — legitimate power users look "abnormal" because their behaviour is unusual, not because they're bad actors. Set it too low and real anomalies slip through because the model learned them as normal over time (model drift). Both failure modes require ongoing monitoring. Ask your team: how often do we retrain, and what's our false positive rate in production?
How recommendation systems use unsupervised ideas
Recommendation systems — Netflix, Spotify, Amazon — are partly built on unsupervised ideas, particularly collaborative filtering: users who behave similarly probably share preferences, even for items neither has seen yet.
The core insight: you don't need to label what makes content "similar." You just need to observe that users who watched A also tended to watch B. The pattern is in the co-occurrence (the fact that things appear together), not in any explicit feature of the content itself.
Modern recommenders combine collaborative filtering with content features (genre, length, creator) and context (time of day, device) in hybrid systems. But the unsupervised intuition — similar users, similar preferences — remains central.
PM Insight
Recommendation quality depends on data density. A new user with no history (the "cold start" problem) gets poor recommendations because there's no behavioural signal to cluster on. Plan for this: what do you show new users before you know anything about them? Onboarding questions, trending content, category selection — these are all cold-start solutions.
What comes next
Unsupervised learning finds patterns. Supervised learning predicts outcomes. But how do you know if any model — supervised or unsupervised — is actually good? That's the question Chapter 9 answers. You'll see how models are evaluated on data they've never seen, why a model that looks great in training can fail in production, and which metrics to demand from your data team before anything ships.
PM Playbook — Questions to ask
- Why this number of clusters? — it's a choice; make sure it's driven by what's actionable, not what looks cleanest mathematically
- Show me the elbow plot. — ask to see within-cluster variance across a range of k values before committing to one
- What features were used to cluster? — features define what "similar" means; clustering on spend alone produces spend-based segments, not behaviour-based ones
- Did we check for outliers before clustering? — extreme users can pull k-means centroids away from the real cluster centres
- Are segments stable over time? — do users move between segments? How fast? Unstable segments are hard to act on
- What's the cold start plan? — for recommendations or personalisation, how do you handle new users with no behavioural history?
- How are we validating segments? — statistical validity (low within-cluster variance) ≠ business validity. Qualitative interviews with users in each segment are the most reliable check
- Are we using segments to inform decisions, or just to describe? — segmentation is only valuable if it changes what you build, who you target, or how you communicate
- For anomaly detection: what's our false positive rate? — a system that flags 30% of legitimate users as anomalies is as harmful as one that misses real anomalies