Chapter 14
Data Infrastructure for PMs
How user actions become data — and why "we have that data" is almost always the wrong assumption to start from.
The question that derails ML projects
A PM proposes a churn prediction model. The data scientist asks: "Do we have historical labels — users who churned, with their behavioral data from the 90 days before they left?" The PM says "yes, we have everything in our database." Three weeks later, the DS team reports that the data doesn't exist in a usable form: it's siloed across three systems, inconsistently logged, and missing the most predictive signals entirely.
This is the most common cause of ML project delays — and it's entirely avoidable if PMs understand how data actually gets from a user's action to a model's training set.
PM Insight
"We have that data" should always be followed by: "in what form? how complete? with what latency? queryable by whom? and does it have the labels we need?" Each of those questions can change the answer from "yes" to "no, but we could build it in 6 weeks."
How data flows: from click to query
A user clicks a button in your app. Here's what has to happen before that click appears in an analyst's query or a model's training data:
Instrumentation — the click fires an event
Your frontend or backend code emits a structured event. This is tracking code that someone explicitly wrote. If nobody instrumented the click, no data is collected — regardless of what your database contains. This is the most common source of "we don't have that data."
Event queue — the event is sent to a stream
The event goes to a message queue (Kafka, Kinesis, Pub/Sub). This is a high-throughput buffer that decouples the client from the storage system. Events can arrive out of order or with delays — which affects timestamp accuracy in your data.
Ingestion — data lands in a raw store
Events are written to a raw data lake (S3, GCS, Azure Blob). This is the immutable record — everything arrives here, often as JSON blobs. It's comprehensive but not queryable in any useful way yet.
Transformation — ETL/ELT pipelines structure the data
Pipelines (dbt, Airflow, Spark) read the raw events and transform them into structured tables: user activity tables, session tables, cohort tables. This is where business logic lives — how "active" is defined, how sessions are counted, how attribution works. These definitions can be wrong.
Data warehouse — analysts query structured tables
Transformed data lands in a warehouse (BigQuery, Snowflake, Redshift, Databricks). This is what your BI tools and analysts query. The latency from step 1 to queryable data is typically 1–24 hours for batch pipelines, near-real-time for streaming architectures.
Feature store — ML teams compute model inputs
For ML specifically, a feature store (Feast, Tecton, Vertex AI Feature Store) precomputes and stores the exact inputs a model needs. This ensures training and serving use identical features — avoiding one of the most dangerous ML failure modes (training-serving skew).
What an event actually looks like
Events are the atomic unit of behavioral data. Every click, view, submission, or error your app generates can be logged as an event. A well-structured event looks something like this:
{
"event_name": "cart_item_added",
"timestamp": "2026-04-23T14:32:11.042Z",
"user_id": "usr_8f2a91c",
"session_id": "ses_7b3d44e",
"platform": "ios",
"properties": {
"item_id": "prod_1842",
"item_category": "electronics",
"price_usd": 89.99,
"quantity": 1,
"source_page": "search_results",
"experiment_variant": "B" // experiment context
}
}
Notice what's in there: who, when, where (platform, page), what (item details), and experiment context. If any of these aren't logged at event time, you can't reconstruct them later. This is why instrumentation decisions made at build time constrain analysis options for years.
PM Insight
Whenever you're planning a new feature, ask your data team: "What events do we need to log, and what properties do we need on each?" Do this in the design phase, not after shipping. Adding instrumentation post-launch means you lose all historical data for that behavior.
The data warehouse: what it is and why it matters to you
A data warehouse is a database optimized for analytical queries — reading and aggregating large volumes of historical data — rather than for transactional operations (reading/writing single records fast).
Your production database (Postgres, MySQL, DynamoDB) is built for the app: fast individual reads and writes, single records at a time. It would fall over if your entire analytics team ran queries on it. The warehouse is a separate, read-optimized copy of your data, built for questions like "how many users who signed up in January were still active in April, broken down by acquisition channel?"
Production DB (app)
- Optimized for single-record reads/writes
- Millisecond response times
- Limited history (performance)
- Powers live user experience
- Not for analytics queries
Data warehouse (analytics)
- Optimized for large scans and aggregations
- Seconds to minutes for big queries
- Full history, years of data
- Powers BI, dashboards, ML training
- Not for live app queries
Common warehouses you'll hear about: BigQuery (Google), Snowflake, Redshift (AWS), Databricks. They're mostly interchangeable from a PM's perspective — what matters is understanding that they exist separately from your production system and have their own latency, freshness, and access constraints.
Feature stores: the ML-specific problem
For ML models, there's an additional layer most PMs don't know exists: the feature store. Understanding it will help you ask much better questions about ML project timelines.
A feature store solves a specific problem: when you train a model, you compute features from historical data (e.g. "number of purchases in the last 30 days"). When you serve the model in production, you need to compute the same features in real-time for each prediction. If those computations are done differently — different code paths, different definitions, different windows — the model sees input distributions in production that look nothing like training. It fails silently.
Training-serving skew — the silent failure mode
The model is trained on features computed one way. In production, the same features are computed slightly differently (different timezone handling, a join that excludes some records, a field that means something subtly different). The model was never tested on what it actually receives. Accuracy degrades, nobody knows why. This is why feature stores exist: one computation, used everywhere.
When your DS team says "we need to build out the feature store before we can ship this model," this is what they mean. It's not stalling — it's the engineering foundation that makes the model reliable.
Why "we have that data" is the wrong starting point
Interactive — Data Readiness Check
Select a data problem your DS team has raised. See what it actually means and what to ask.
Data quality: the five problems that derail analysis
Data pipelines: why "it'll take 2 weeks" isn't stalling
When a DS team says they need 2–4 weeks before they can start analysis or training, they're often waiting on or building a data pipeline. A pipeline is automated code that transforms raw data into analysis-ready tables on a schedule.
Building a pipeline involves: writing transformation logic, handling edge cases and bad data, setting up scheduling and alerting, testing for correctness, and ensuring it doesn't break when upstream schemas change. For a clean, well-documented data environment this can take days. For a messy one with inconsistent logging and schema debt, it takes weeks.
PM Insight
The single best investment a PM can make in their data team's velocity is clean, consistent instrumentation from day one. Every event logged without a schema review, every field named inconsistently, and every platform that doesn't log the same events becomes debt that someone pays in weeks of pipeline work before every analysis.
What PMs need to be able to do (not just know)
You don't need to write SQL or build pipelines. But the following will make you dramatically more effective working with data and DS teams:
Read a SQL query
Not write — read. Understand what SELECT, FROM,
WHERE, GROUP BY, and JOIN do. When
your DS team shares an analysis, being able to read the query lets you spot
whether the definition of "active user" or "converted" matches what you meant.
Many analytical errors live in the query definition, not the interpretation.
Know what's instrumented
Maintain or have access to an event catalog — a list of what events exist, what properties they carry, and when they were introduced. Before starting any analysis project, check it. "Do we log that?" is a question that should take 2 minutes, not 2 days.
Spec instrumentation in feature PRDs
Every feature PRD should include an instrumentation section: what events need to fire, on which user actions, with which properties. Write this before engineering starts, not after. Once a feature ships without instrumentation, you've lost all historical baseline data for it.
Ask "what's the data freshness?" before trusting a dashboard
Batch pipelines often have 6–24 hour latency. If you're looking at a dashboard during an incident and asking "why is this metric down?", check when the data was last updated. Real-time issues require streaming dashboards, not batch ones.
PM Playbook — Questions to ask
- Are we logging [X]? — before assuming you have data, check the event catalog
- What properties does that event carry? — event existence doesn't mean it has the fields you need
- When was that event introduced? — events added after a feature launched have no historical data
- What's the pipeline latency? — how old is the data you're looking at?
- How is [metric X] defined in the pipeline? — read the query definition, not just the dashboard label
- Do we have labels for what we're trying to predict? — for any ML project, this is the first feasibility question
- Is the same feature computed the same way in training and serving? — training-serving skew check for any production model
- What does the data quality look like for this dataset? — ask for null rates, duplicate rates, and coverage before building on it
- What events do we need to add to the PRD? — instrumentation is a feature requirement, not an afterthought