Chapter 30 of 75

Data Strategy for Product Managers

AI products are data products. A PM who cannot think clearly about data quality, data pipelines, and data governance cannot effectively manage an AI feature — they will be surprised by problems that were predictable from the data.

5 min read

Part III — The Practitioner's Toolkit

Data Strategy for Product Managers

Product managers working on AI features quickly discover that data is not an engineering implementation detail — it is a product constraint. The availability, quality, and structure of data determines what the AI can do, how well it can do it, and how quickly it can improve. PMs who treat data strategy as someone else's problem ship AI features that underperform for reasons that were entirely predictable.

What You Will Learn

The four data questions every PM must answer before committing to AI feature timelines
How to assess data quality without being a data engineer
The data pipeline dependencies that determine actual AI feature timelines
How to build data flywheels that improve AI products over time

30.1 Four Data Questions Before Committing to Timelines

Before scoping any AI feature, answer these four questions. Honest answers to these questions will save weeks of rework.

Question 1: Does the data exist? Is the data the AI needs to perform its task available in the organization's systems? Organizations routinely discover that data they assumed was available is siloed in systems with poor APIs, stored in formats that require significant processing, or simply not captured at all.

Question 2: Is it accessible? Even if the data exists, can the engineering team access it? Data access is frequently blocked by organizational politics (the owning team has not prioritized the integration), technical limitations (the system does not expose an API), or governance processes (a data use agreement must be negotiated before access is granted). Each of these can add weeks or months to a timeline.

Question 3: Is it in the right format? Raw data and training-ready data are not the same thing. Unstructured text must be parsed, cleaned, and labeled. Numerical data may need normalization, outlier handling, and feature engineering. Temporal data needs consistency checks. Data transformation is its own engineering work and its own source of timeline risk.

Question 4: Is it representative? Does the available data reflect the distribution of inputs the AI will encounter in production? A model trained on data from 2019–2022 may not generalize to 2025 inputs if the underlying patterns have shifted. A model trained on data from headquarters may not generalize to regional offices with different workflows.

PMs who can answer these four questions before committing to timelines avoid the most common AI project failure mode: discovering mid-development that the data situation is more complex than assumed.

30.2 Assessing Data Quality Without Being a Data Engineer

PMs do not need to write SQL to assess data quality. They need to ask the right questions and interpret the answers.

Completeness. What percentage of records have values in each field? A customer database where 40% of records are missing the industry code is a problem for any AI that uses industry as a feature. Ask for null rates on key fields.

Consistency. Is the same concept represented consistently across the dataset? "Customer" may mean different things in different system-of-record databases — a billing customer, a CRM contact, a portal user. Inconsistent entity definitions produce confusing training data.

Freshness. How recent is the data? How frequently is it updated? For AI features that depend on current state (inventory levels, customer sentiment, market prices), stale data is a correctness problem, not just a performance problem.

Label quality. For supervised learning tasks, the quality of the labels determines the ceiling on model performance. Who labeled the training data? What were their instructions? How much disagreement was there between labelers? Labels produced by inconsistent or poorly-instructed labelers cannot train a high-quality model, regardless of volume.

Coverage of rare events. For AI features that handle edge cases (fraud detection, error classification, anomaly detection), the rare events are often underrepresented in historical data. A fraud dataset where 99.9% of transactions are legitimate requires deliberate oversampling of fraud cases to produce a useful model.

30.3 Data Pipeline Dependencies

AI features have more upstream dependencies than traditional software features. Understanding these dependencies is essential for realistic planning.

Ingestion pipelines. Data must move from its source systems to where it can be used for training and inference. If the ingestion pipeline does not exist, building it is a prerequisite — and it may require coordination with teams that own the source systems.

Labeling pipelines. For supervised learning, labeled data must be produced by a labeling process. Labeling workflows — whether done by internal SMEs, external labelers, or LLM-assisted labeling — require tooling, instructions, quality control, and time.

Feature engineering pipelines. The raw data must be transformed into the features the model uses. These transformations must run in both the training environment and the production environment — and keeping them consistent is a significant engineering responsibility.

Feedback loops. AI features improve when the product captures signals about output quality — thumbs-up/thumbs-down, correction actions, downstream outcomes. Building these feedback loops into the product from the start, rather than retrofitting them later, is one of the highest-leverage data strategy decisions a PM can make.

30.4 Building Data Flywheels

The best AI products improve over time because they capture data from their own usage that makes the model better. This is the data flywheel: usage generates data, data improves the model, a better model drives more usage.

Building a data flywheel requires deliberate design:

Define the feedback signal. What user action signals that the AI output was good or bad? An explicit rating (thumbs up/down) is clean but low-volume. An implicit signal (the user accepted the AI draft without editing) is noisier but higher-volume. Design the product to capture the best available signal.

Store the input-output-feedback triple. For every AI inference, store the input, the output, and the feedback signal. This is the raw material for model improvement. Many teams store the output but not the input — a costly oversight.

Close the retraining loop. Captured feedback is only valuable if it is used to retrain or evaluate the model. Plan for regular retraining cycles — quarterly for stable tasks, more frequently for tasks where the data distribution evolves.

Protect user privacy. Feedback data may contain personal information. Ensure that the data collection, storage, and use is consistent with the product's privacy policy and applicable regulations. Build privacy controls into the feedback pipeline from the start.

Data flywheels are what separate AI products that improve from AI products that stagnate. They require intentional design, not just good initial data.

← Back to AI Beyond the Demo