Chapter 13

Working with Data Teams

How to scope ML projects, set up your data scientists for success, spot what's truly feasible, and avoid the failure modes nobody warns you about.

⏱ 15 min read 🏁 Final chapter

The PM–data team dynamic

The relationship between product managers and data scientists is one of the most consequential — and most commonly strained — partnerships in a tech company. PMs push for faster timelines and clear business outcomes. Data scientists deal with messy data, uncertain results, and work that can't always be planned in two-week sprints.

Most of the friction comes not from personality clashes but from structural mismatches: different vocabularies, different definitions of "done," and different intuitions about what's hard. This chapter is about closing that gap — from the PM side.

PM Insight

The single most valuable thing you can do as a PM working with data teams is show up with a well-scoped problem, not a pre-specified solution. "We need to increase retention by 5%" is a better brief than "we need a churn prediction model." Let data scientists propose the approach — they'll know what's feasible and what's not.

What "feasible" actually means in ML

"Can we build a model to predict X?" is asked in almost every product planning cycle. The answer depends on four things that are rarely checked upfront:

✓ The signal exists

Does the input data actually predict the outcome?
If the relationship is random or weak, no model will find it
Domain expertise (yours) is the first filter here

✓ The data exists and is clean

Is the historical data actually logged and accessible?
Are the labels defined and reliable?
How much of the data is missing, corrupt, or inconsistent?

✗ Common blockers

The outcome isn't logged (you can't train on what you didn't measure)
Historical data reflects old behaviour, not current user base
Labels are ambiguous or inconsistently applied

✗ More common blockers

The business case doesn't cover the engineering cost
Latency requirements rule out the models that would work
Regulation or privacy constraints limit available features

PM Insight

Run a "data availability audit" before committing to any ML project. Spend an hour with a data engineer asking: is this outcome logged? How far back does the data go? What's the label quality like? How many examples do we have? This conversation will save weeks of work if the answer to any of these is "we don't know."

The ML project lifecycle

ML projects have a fundamentally different lifecycle from software features. The core difference: in software, you know when you're done. In ML, you know what you're optimising for — but you don't know in advance what performance you'll achieve, or how long it will take.

Problem framing

Define the outcome to predict, the label, the business metric, and the acceptance criteria. This is where PMs add the most value — and where the most expensive mistakes are made.

Data exploration

Understand the available data: distributions, missing values, label quality, class balance. Often reveals that the original framing needs to change. Expect this phase to surface surprises.

Baseline

Build the simplest possible model — a heuristic, a rule, or a logistic regression. This sets the floor. Everything else is measured against it. A strong baseline is a sign of a mature team.

Modelling & iteration

Feature engineering, model selection, hyperparameter tuning, error analysis. Iterative and often non-linear. Teams can spin here for weeks if acceptance criteria aren't clear.

Evaluation & sign-off

Hold-out test set evaluation, business case validation, bias audit, stakeholder review. Don't skip the bias audit. The discomfort is the point — you're looking for behaviours you don't want to ship.

Production & integration

Feature pipelines, serving infrastructure, A/B test setup, logging. Often takes as long as the modelling phase and is chronically underscoped in project plans.

Monitoring & maintenance

Ongoing. Model performance tracking, drift detection (automatically watching for signs that inputs or outputs are shifting away from what the model was trained on), and retraining triggers. The project isn't over at launch — it's moved into operations. Plan for this before you start.

Failure modes nobody warns you about

ML projects fail differently from software projects. The code compiles. The model trains. The metrics look fine. And yet.

🎯

Optimising the wrong metric

The model maximises what it's measured on, not what the business needs. A content recommendation model optimised for clicks learns to recommend outrage. Align the training objective to actual business value, not a convenient proxy.

🔒

The model works but nothing changes

The model is accurate, the A/B test is positive, but downstream teams don't act on its outputs. ML value is delivered through decisions and actions — not through model accuracy alone. Map the full action loop before investing in the model.

🌊

The training data is the product of the current system

A model trained on "what users clicked on" learns to replicate the decisions of the current ranking system, including its biases. If you want to escape the current system's logic, you need data that isn't contaminated by it.

🧪

Can't experiment because the model is the product

When the ML model is the core feature, A/B testing it is hard — the control is the old model, which means running two serving stacks. Teams skip experiments to save infrastructure cost and lose the ability to measure impact.

📅

Timeline anchored on modelling, not deployment

"The model is done" ≠ shipped. Feature pipelines, serving infra, monitoring, shadow testing, and A/B ramp-up routinely double the time from "model complete" to "in production." Scope it all at kickoff.

📉

Silent degradation

The model degrades gradually as data distributions shift. Without monitoring, the first sign of failure is a user complaint or a business metric review — weeks or months after the model stopped performing well.

Ownership and handoff: DS, MLE, and SWE

Most stalled ML projects aren't stuck on a technical problem. They're stuck at an ownership boundary — where one team thinks they've handed off and the other team hasn't picked up. PMs feel this as "the model is done but nothing is happening." Learning to diagnose it is one of the highest-leverage skills you can develop.

The three roles (and what they actually own)

Role names vary by company — some organisations fold ML engineering into data science or into platform infrastructure. What matters is that the work has to be done somewhere. Map responsibilities, not job titles.

Data Scientist (DS)

Problem framing and data understanding
Feature selection and model choice
Offline evaluation and error analysis
Prototype in a notebook or training script

Deliverable: a model that hits offline criteria

ML Engineer (MLE)

Reproducible training pipelines
Feature pipelines and model serving
Model registry and versioning
Monitoring and retraining infrastructure

Deliverable: a model that runs reliably at scale

Software Engineer (SWE)

API contracts and product integration
Fallback behaviour and error states
A/B test wiring and experiment logging
Latency budgets at the application layer

Deliverable: a working product feature

Where handoffs happen — and break

The four critical handoff points map directly onto the project lifecycle. Each is a place where work can silently stop moving.

Problem framing → Data (PM/DS to Data Engineering)

Is the outcome you want to predict actually logged, reliably, in a table you can train on? This handoff breaks when product engineers own the logging and don't know an ML team is depending on it — so schema changes or gaps go unannounced.

Notebook → Training pipeline (DS to MLE)

A working prototype and a production training pipeline are different jobs. Turning a notebook into a reproducible, scheduled, monitored pipeline requires MLE skills and MLE capacity. Breaks when DS declares "the model is done" before MLE has been staffed or briefed.

Model → Serving (MLE to SWE)

Who wraps the model in an API? Who sets the timeout and the fallback? Who logs predictions to the data warehouse for future retraining? Breaks when each side assumes the other is handling it — discovered at integration week, two days before launch.

Launch → Operations (everyone to nobody)

Who gets paged when the model degrades? Who decides to retrain, and who executes it? Drift dashboards exist; on-call rosters often don't. This handoff breaks by default because monitoring ownership is almost never assigned at kickoff — only after something goes wrong.

Why handoffs go wrong

🧩

Throw it over the wall

DS finishes the model and moves on before MLE has picked it up. The preprocessing steps that made the model work live only in the data scientist's head — or in an undocumented notebook. Reproducing them in a production pipeline takes weeks and often reintroduces bugs.

🪞

Training/serving skew

Features are computed one way during training and a different way in serving — because DS and MLE built the logic independently. The model looked great offline and underperforms in production. This is one of the most common causes of the offline/online gap, and it's entirely an ownership problem.

🕳️

No one owns the feature pipeline end-to-end

The model depends on upstream data produced by another team. When that team changes a schema, deprecates a field, or introduces latency, the model silently breaks. Because no single team owns the full chain, no single team gets notified — and no one notices until the model's outputs stop making sense.

📟

Monitoring is everyone's job, therefore no one's

Drift dashboards exist. On-call rotations don't. Three teams each assume one of the others is watching the model health metrics. The first signal of degradation is a support ticket or a quarterly business review — by which point the model has been underperforming for months.

PM Insight

The most dangerous phrase in an ML project is "the model is done." Done by whom? Handed to whom? To do what next? If the answer is vague, your launch date is wrong. At kickoff, draw the full ownership map — not just who builds the model, but who owns the feature pipeline, the serving layer, the monitoring, the retraining trigger, and the rollback decision. If any box is empty, that's a launch risk, not a detail to sort out later.

How to be a great PM for a data team

Before the project starts

Define success before anyone writes code. What offline metric threshold would make you confident enough to run an A/B test? What business metric needs to move, by how much, in what timeframe?
Audit data availability first. Don't commit to a timeline until you know what data exists and what quality it's in.
Baseline before ML. What's the simplest non-ML solution? How well does it perform? This sets the bar and often reveals that ML isn't necessary.

During the project

Protect exploration time. Data scientists need unstructured time to understand the data. Scrum ceremonies that treat exploration as a deliverable kill this.
Review error analysis, not just metrics. Ask to see examples of what the model gets wrong. Patterns in errors reveal more than aggregate metrics and often drive the highest-leverage improvements.
Be the domain expert. Data scientists know the algorithms. You know the users and the business logic. Show up with that knowledge — it shapes which features matter and which errors are most costly.

At launch and beyond

Define the monitoring plan before launch. Not after. Who checks what, how often, and what triggers a rollback?
Schedule a retrospective at 30 and 90 days. Did the offline metrics predict online impact? What would you do differently? This builds institutional knowledge for the next ML project.
Celebrate experiments that fail cleanly. A well-run negative result is valuable. A team that kills bad ideas quickly is healthier than one that only ships winners.

PM Insight

The best PM–data scientist partnerships are built on mutual respect for each other's domain. Data scientists aren't vending machines for models — they're researchers who need the right problem, the right data, and the right success criteria to do their best work. Give them those, and get out of the way.

The vocabulary bridge

Miscommunication between PMs and data teams often comes from the same words meaning different things. A short reference:

When they say "the model is accurate"

Ask: accurate on what metric? On what dataset? On which subgroups? At what threshold? "Accurate" has no meaning without context.

When they say "we need more data"

Ask: more data of what type? More labels, more features, or more volume of what we already have? Each has a different solution.

When they say "the model is ready"

Ask: ready for what? Ready for offline evaluation is different from ready for shadow mode, which is different from ready for production. Clarify the stage.

When they say "this is a hard problem"

Ask: hard because the signal doesn't exist, hard because we lack data, hard because of latency constraints, or hard because it's novel research? Each is a different kind of hard.

PM Playbook — Questions to ask at every stage

At project kickoff

What's the simplest baseline we can build first?
Is the training outcome actually logged in our data warehouse?
What are the acceptance criteria — and are they agreed before we start?
What's the full timeline including deployment, not just modelling?
Who owns the feature pipeline, the serving layer, and the monitoring — by name, not by team?
Which handoffs does this project depend on, and has each receiving team committed capacity?

During development

Can I see examples of what the model gets wrong?
Is performance consistent across user segments?
What would cause us to stop this project? — name the kill criteria upfront
Is feature computation shared between training and serving, or implemented separately by two teams?

Before launch

What's the monitoring plan and who owns it?
Who gets paged when the model degrades, and what's their runbook?
What triggers a rollback — and who makes that call?
Have we run a bias audit?
How does the model degrade gracefully if something breaks?

Post-launch

Did offline metrics predict online impact?
What's the retraining cadence?
What did we learn that should change how we scope the next ML project?

You've finished the guide

You now have the vocabulary and the questions. The gap between PMs who ship good AI products and ones who don't isn't technical depth — it's whether they ask the right questions or wait to be told the answers.

The field moves fast. The specific models, tools, and benchmarks will change. But the fundamentals — distributions, causation, experimental design, how learning works, how to evaluate models, how to build with AI thoughtfully — those stay stable. That's what this guide built.

Where to go next

The best way to deepen this knowledge is to apply it in a real project. Find one ML initiative at your company and ask to be more involved in the technical conversations. Bring the questions from this guide. You'll learn more in a month of engaged practice than in any course.

Check your understanding 5 questions