Chapter 14

Data Infrastructure for PMs

How user actions become data — and why "we have that data" is almost always the wrong assumption to start from.

⏱ 18 min read 📊 Includes interactive

The question that derails ML projects

A PM proposes a churn prediction model. The data scientist asks: "Do we have historical labels — users who churned, with their behavioral data from the 90 days before they left?" The PM says "yes, we have everything in our database." Three weeks later, the DS team reports that the data doesn't exist in a usable form: it's siloed across three systems, inconsistently logged, and missing the most predictive signals entirely.

This is the most common cause of ML project delays — and it's entirely avoidable if PMs understand how data actually gets from a user's action to a model's training set.

PM Insight

"We have that data" should always be followed by: "in what form? how complete? with what latency? queryable by whom? and does it have the labels we need?" Each of those questions can change the answer from "yes" to "no, but we could build it in 6 weeks."


How data flows: from click to query

A user clicks a button in your app. Here's what has to happen before that click appears in an analyst's query or a model's training data:

1

Instrumentation — the click fires an event

Your frontend or backend code emits a structured event. This is tracking code that someone explicitly wrote. If nobody instrumented the click, no data is collected — regardless of what your database contains. This is the most common source of "we don't have that data."

2

Event queue — the event is sent to a stream

The event goes to a message queue (Kafka, Kinesis, Pub/Sub). This is a high-throughput buffer that decouples the client from the storage system. Events can arrive out of order or with delays — which affects timestamp accuracy in your data.

3

Ingestion — data lands in a raw store

Events are written to a raw data lake (S3, GCS, Azure Blob). This is the immutable record — everything arrives here, often as JSON blobs. It's comprehensive but not queryable in any useful way yet.

4

Transformation — ETL/ELT pipelines structure the data

Pipelines (dbt, Airflow, Spark) read the raw events and transform them into structured tables: user activity tables, session tables, cohort tables. This is where business logic lives — how "active" is defined, how sessions are counted, how attribution works. These definitions can be wrong.

5

Data warehouse — analysts query structured tables

Transformed data lands in a warehouse (BigQuery, Snowflake, Redshift, Databricks). This is what your BI tools and analysts query. The latency from step 1 to queryable data is typically 1–24 hours for batch pipelines, near-real-time for streaming architectures.

6

Feature store — ML teams compute model inputs

For ML specifically, a feature store (Feast, Tecton, Vertex AI Feature Store) precomputes and stores the exact inputs a model needs. This ensures training and serving use identical features — avoiding one of the most dangerous ML failure modes (training-serving skew).


What an event actually looks like

Events are the atomic unit of behavioral data. Every click, view, submission, or error your app generates can be logged as an event. A well-structured event looks something like this:

// User adds an item to cart — event fired by frontend
{
  "event_name": "cart_item_added",
  "timestamp": "2026-04-23T14:32:11.042Z",
  "user_id": "usr_8f2a91c",
  "session_id": "ses_7b3d44e",
  "platform": "ios",
  "properties": {
    "item_id": "prod_1842",
    "item_category": "electronics",
    "price_usd": 89.99,
    "quantity": 1,
    "source_page": "search_results",
    "experiment_variant": "B"  // experiment context
  }
}

Notice what's in there: who, when, where (platform, page), what (item details), and experiment context. If any of these aren't logged at event time, you can't reconstruct them later. This is why instrumentation decisions made at build time constrain analysis options for years.

PM Insight

Whenever you're planning a new feature, ask your data team: "What events do we need to log, and what properties do we need on each?" Do this in the design phase, not after shipping. Adding instrumentation post-launch means you lose all historical data for that behavior.


The data warehouse: what it is and why it matters to you

A data warehouse is a database optimized for analytical queries — reading and aggregating large volumes of historical data — rather than for transactional operations (reading/writing single records fast).

Your production database (Postgres, MySQL, DynamoDB) is built for the app: fast individual reads and writes, single records at a time. It would fall over if your entire analytics team ran queries on it. The warehouse is a separate, read-optimized copy of your data, built for questions like "how many users who signed up in January were still active in April, broken down by acquisition channel?"

Production DB (app)

  • Optimized for single-record reads/writes
  • Millisecond response times
  • Limited history (performance)
  • Powers live user experience
  • Not for analytics queries

Data warehouse (analytics)

  • Optimized for large scans and aggregations
  • Seconds to minutes for big queries
  • Full history, years of data
  • Powers BI, dashboards, ML training
  • Not for live app queries

Common warehouses you'll hear about: BigQuery (Google), Snowflake, Redshift (AWS), Databricks. They're mostly interchangeable from a PM's perspective — what matters is understanding that they exist separately from your production system and have their own latency, freshness, and access constraints.


Feature stores: the ML-specific problem

For ML models, there's an additional layer most PMs don't know exists: the feature store. Understanding it will help you ask much better questions about ML project timelines.

A feature store solves a specific problem: when you train a model, you compute features from historical data (e.g. "number of purchases in the last 30 days"). When you serve the model in production, you need to compute the same features in real-time for each prediction. If those computations are done differently — different code paths, different definitions, different windows — the model sees input distributions in production that look nothing like training. It fails silently.

Training-serving skew — the silent failure mode

The model is trained on features computed one way. In production, the same features are computed slightly differently (different timezone handling, a join that excludes some records, a field that means something subtly different). The model was never tested on what it actually receives. Accuracy degrades, nobody knows why. This is why feature stores exist: one computation, used everywhere.

When your DS team says "we need to build out the feature store before we can ship this model," this is what they mean. It's not stalling — it's the engineering foundation that makes the model reliable.


Why "we have that data" is the wrong starting point

Interactive — Data Readiness Check

Select a data problem your DS team has raised. See what it actually means and what to ask.

"We don't have the events logged for that."
"The data quality is too poor to train on."
"We don't have labels for this."
"The data isn't fresh enough."
"We have a training-serving skew problem."

Data quality: the five problems that derail analysis

Missing data
Events are fired inconsistently — some platforms log them, others don't. Or a field exists in the schema but is null for 40% of records.
Impact: biased analysis; model trained on non-representative users
Schema drift
An event's structure changed at some point — a field was renamed, removed, or its meaning changed. Old and new data look similar but mean different things.
Impact: silent errors in dashboards; model features computed incorrectly on historical data
Duplicate events
Network retries, client-side bugs, or pipeline failures cause the same event to be logged multiple times. A session might appear to have twice the actions it actually had.
Impact: inflated engagement metrics; model learns from fake signal
Clock skew / timestamp issues
Client-side events use device clocks (which are wrong). Events arrive out of order. A user's action appears to happen after the event that triggered it.
Impact: broken funnel analysis; impossible sequences confuse models
Identity resolution failures
The same user has different IDs across sessions, platforms, or login states (logged in vs anonymous). Events can't be stitched into a coherent user journey.
Impact: user counts are wrong; model can't see the full picture of a user's behavior
Survivorship bias
Your historical data only contains users who reached a certain state (e.g. you only have data on users who signed up, not on people who bounced before signup).
Impact: model learns from winners only; fails on the population it needs to serve

Data pipelines: why "it'll take 2 weeks" isn't stalling

When a DS team says they need 2–4 weeks before they can start analysis or training, they're often waiting on or building a data pipeline. A pipeline is automated code that transforms raw data into analysis-ready tables on a schedule.

Building a pipeline involves: writing transformation logic, handling edge cases and bad data, setting up scheduling and alerting, testing for correctness, and ensuring it doesn't break when upstream schemas change. For a clean, well-documented data environment this can take days. For a messy one with inconsistent logging and schema debt, it takes weeks.

PM Insight

The single best investment a PM can make in their data team's velocity is clean, consistent instrumentation from day one. Every event logged without a schema review, every field named inconsistently, and every platform that doesn't log the same events becomes debt that someone pays in weeks of pipeline work before every analysis.


What PMs need to be able to do (not just know)

You don't need to write SQL or build pipelines. But the following will make you dramatically more effective working with data and DS teams:

Read a SQL query

Not write — read. Understand what SELECT, FROM, WHERE, GROUP BY, and JOIN do. When your DS team shares an analysis, being able to read the query lets you spot whether the definition of "active user" or "converted" matches what you meant. Many analytical errors live in the query definition, not the interpretation.

Know what's instrumented

Maintain or have access to an event catalog — a list of what events exist, what properties they carry, and when they were introduced. Before starting any analysis project, check it. "Do we log that?" is a question that should take 2 minutes, not 2 days.

Spec instrumentation in feature PRDs

Every feature PRD should include an instrumentation section: what events need to fire, on which user actions, with which properties. Write this before engineering starts, not after. Once a feature ships without instrumentation, you've lost all historical baseline data for it.

Ask "what's the data freshness?" before trusting a dashboard

Batch pipelines often have 6–24 hour latency. If you're looking at a dashboard during an incident and asking "why is this metric down?", check when the data was last updated. Real-time issues require streaming dashboards, not batch ones.


PM Playbook — Questions to ask


4 questions