Chapter 08

Unsupervised Learning

Clustering, user segmentation, and how ML finds structure in data with no labels to learn from.

⏱ 20 min read 📊 Includes 2 interactives

Learning without answers

Supervised learning needs labeled data — examples with known answers. But what if you don't have labels? What if you just have a large pile of user behaviour data and want to know: are there natural groups in here?

That's unsupervised learning. No labels, no right answers — just algorithms that find structure in raw data. The two most useful flavours for product work are clustering (grouping similar things) and dimensionality reduction (compressing many variables into fewer, meaningful ones).

To make this concrete, we'll follow one scenario throughout the chapter. Imagine you're a PM at a B2B SaaS company. You have 4,000 customers and a dashboard full of behavioural signals: login frequency, features used, team size, support tickets opened, time-to-value. You suspect your customers fall into distinct archetypes — but nobody has labeled them. Unsupervised learning is how you find out.

PM Insight

Unsupervised learning is where a lot of exploratory product analytics lives. "Who are our users?" comes before "what do our users want?" — and it's an unsupervised question. You're discovering structure, not predicting outcomes. The answers shape roadmap priorities, messaging, and which users you talk to next.


Clustering: finding natural groups

Clustering algorithms group data points so that things within a group are more similar to each other than to things in other groups. The algorithm defines "similar" mathematically — usually by distance in feature space.

K-means is the most common clustering algorithm. It works like this:

  1. You choose k — the number of clusters you want
  2. The algorithm places k centroids (cluster centres) randomly
  3. Each data point is assigned to its nearest centroid
  4. Each centroid moves to the average position of its assigned points
  5. Steps 3–4 repeat until assignments stop changing

Interactive — K-Means Clustering

Each is a cluster centroid. Coloured dots are data points assigned to the nearest centroid. Step through iterations manually or hit Run to animate convergence.

k = 3 clusters
Iteration: 0

What k-means can't do

K-means has real limitations that matter for product decisions:

PM Insight

When your data team says "we ran clustering and found 5 user segments," ask: Why 5? What happens at 3 or 8? The number of clusters is a choice, not a discovery. Make sure it's driven by what's actionable, not what's mathematically convenient.

Picking k: the elbow method

One way to inform the choice of k is the elbow method. Run k-means for k = 1, 2, 3 … and measure the total within-cluster variance at each step — how spread out points are inside their assigned cluster. As k increases, variance always falls (more clusters = tighter fits). But the improvement gets smaller each time. Plot it and look for the "elbow" — the point where adding another cluster stops buying you much. That's a candidate for k.

For our B2B SaaS company: running the elbow method suggests 4 clusters is where the curve bends. Below 4, customers are lumped together in ways that don't match any recognisable archetype. Above 4, the new segments are too small to act on.

Interactive — Elbow Method

Each point shows the within-cluster variance for a given k. The curve bends sharply at the elbow — that's where adding more clusters stops meaningfully reducing variance. Click a point to select k.


Other clustering algorithms

K-means is the starting point, but your data team may reach for other algorithms when the data doesn't fit k-means' assumptions. Here's what you need to know:

Algorithm Need to choose k? Cluster shape Handles noise/outliers? When to use
K-means Yes Spherical blobs No — outliers distort centroids Default choice; fast, scalable, interpretable
DBSCAN No Any shape Yes — labels outliers as noise Irregular cluster shapes; fraud/anomaly detection; geographic data
Hierarchical No (choose after) Any shape Partially When you want to explore a range of k at once; small-to-medium datasets

DBSCAN — density-based clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are densely packed together and labels isolated points as noise. You don't choose k — the algorithm finds however many clusters exist in the data based on two parameters: how close points need to be (epsilon) and how many neighbours define a cluster (min_samples). This makes it powerful for irregular shapes and outlier detection, but sensitive to those two parameters. If your team uses DBSCAN, ask them to show you how results change as epsilon varies.

Hierarchical clustering — building a tree

Hierarchical clustering builds a dendrogram — a tree that shows how data points progressively merge into clusters. You don't commit to a k upfront; instead you "cut" the tree at whatever level gives a useful number of clusters. This lets you explore the structure at multiple resolutions. The downside is cost: it scales poorly to large datasets (millions of users), so in practice it's used for smaller segmentation studies or to validate k-means results.

PM Insight

For most product segmentation work, k-means is the right default. Push for DBSCAN when you suspect your user base has outliers distorting the segments (e.g., power users or bots pulling centroids) or when cluster shapes are clearly non-spherical. Push for hierarchical clustering when you want to explore the data before committing to a number of segments.


Practical clustering: user segmentation

User segmentation is one of the highest-value applications of clustering in product. Instead of treating all users the same, you identify groups with distinct behaviour patterns — then tailor features, messaging, or interventions to each.

Back to our B2B SaaS company. Running k-means with k = 4 on login frequency, features used, support tickets, and time-to-value surfaces four distinct groups:

None of this came from labels. No one tagged customers as "at-risk" — the algorithm found that group by noticing that some accounts share a pattern of declining engagement. The PM's job is to look at each cluster and ask: what does this tell us about what we should build or do next?

Segmentation vs personas

Clustering produces data-driven segments — groups that actually exist in your user base, defined by measurable behaviour. Personas are qualitative archetypes, often hand-crafted. Both are useful; they answer different questions. Clustering tells you who is here. Personas tell you what they care about. The best product teams use both: segments to find the groups, qualitative interviews to understand what drives them.


Dimensionality reduction: seeing the shape of data

Users generate hundreds of behavioural signals. You can't visualise 200-dimensional data — human brains top out at 3 dimensions. Dimensionality reduction is a technique that compresses many variables down into 2 or 3 while preserving as much structure as possible — so you can actually plot it and see what's going on.

The two most common techniques:

PM Insight

If your data team shows you a 2D scatter plot of users and says "these are your segments," ask what technique they used. t-SNE plots look compelling but can be misleading — cluster sizes and distances between clusters are not interpretable. They show you which users are similar, not how different segments are from each other.


Anomaly detection: when a point fits no cluster

If most users cluster together and a few data points don't fit any cluster, those outliers might be anomalies — fraud, bots, errors, or genuinely unusual behaviour. DBSCAN explicitly labels these points as noise. K-means forces every point into a cluster, but you can detect anomalies by looking for points with unusually high distance to their centroid — they "belong" to a cluster only reluctantly.

Anomaly detection is used heavily in:

When anomaly detection goes wrong

Anomaly detection fails in two directions. Set the sensitivity too high and you flag everything — legitimate power users look "abnormal" because their behaviour is unusual, not because they're bad actors. Set it too low and real anomalies slip through because the model learned them as normal over time (model drift). Both failure modes require ongoing monitoring. Ask your team: how often do we retrain, and what's our false positive rate in production?


How recommendation systems use unsupervised ideas

Recommendation systems — Netflix, Spotify, Amazon — are partly built on unsupervised ideas, particularly collaborative filtering: users who behave similarly probably share preferences, even for items neither has seen yet.

The core insight: you don't need to label what makes content "similar." You just need to observe that users who watched A also tended to watch B. The pattern is in the co-occurrence (the fact that things appear together), not in any explicit feature of the content itself.

Modern recommenders combine collaborative filtering with content features (genre, length, creator) and context (time of day, device) in hybrid systems. But the unsupervised intuition — similar users, similar preferences — remains central.

PM Insight

Recommendation quality depends on data density. A new user with no history (the "cold start" problem) gets poor recommendations because there's no behavioural signal to cluster on. Plan for this: what do you show new users before you know anything about them? Onboarding questions, trending content, category selection — these are all cold-start solutions.


What comes next

Unsupervised learning finds patterns. Supervised learning predicts outcomes. But how do you know if any model — supervised or unsupervised — is actually good? That's the question Chapter 9 answers. You'll see how models are evaluated on data they've never seen, why a model that looks great in training can fail in production, and which metrics to demand from your data team before anything ships.


PM Playbook — Questions to ask


4 questions