Chapter 35 of 75
Data Architecture for AI
AI capability is bounded by data architecture. The organizations that win with AI are not the ones with the most advanced models — they are the ones whose data is accessible, governable, and connected. Data architecture for AI is a strategic investment, not a technical prerequisite.
Part IV — Enterprise AI Architecture
Data Architecture for AI
Every AI capability gap that organizations experience can be traced to a data architecture gap. The best-funded AI initiative will underperform if the data it needs is siloed, ungoverned, poorly documented, or inaccessible to AI systems. Data architecture for AI is not a separate initiative from AI strategy — it is the foundation on which AI strategy executes.
What You Will Learn
- The data architecture patterns that enable versus impede AI capability
- How data mesh and data lakehouse architectures apply to AI use cases
- Feature stores and their role in ML infrastructure
- The governance structures that make enterprise AI data trustworthy
35.1 Data Architecture Patterns for AI
The data warehouse. A centralized, structured store of cleaned and integrated data from multiple source systems. Data warehouses (Snowflake, BigQuery, Redshift) are optimized for SQL analytics and reporting. They are suitable as the source of structured training data and as the target for AI-generated structured outputs. They are not well-suited to storing unstructured content (documents, text, images) or to serving low-latency inference requests.
The data lake. A centralized store of raw, unprocessed data in its native format. Data lakes can hold structured, semi-structured, and unstructured data at scale. They are the appropriate home for the raw content that feeds document AI (contracts, emails, support tickets, product documentation). The challenge with data lakes is quality: without governance, they accumulate data that is inconsistent, poorly documented, and difficult to use.
The data lakehouse. A hybrid architecture that combines the scale and format flexibility of a data lake with the structure, governance, and query performance of a data warehouse. Delta Lake, Apache Iceberg, and Apache Hudi are the primary lakehouse table formats. The lakehouse is increasingly the default enterprise data architecture because it supports both analytics workloads and the unstructured data that AI requires.
Data mesh. An organizational and architectural pattern that distributes data ownership to the domain teams that produce the data, with a common infrastructure platform for data publication and discovery. Data mesh addresses the governance gap in centralized architectures — data is owned and maintained by the teams with the most context about its meaning and quality. For AI, data mesh is relevant because it assigns accountability for data quality to the teams that can actually ensure it.
35.2 What AI Requires from Data Architecture
AI systems make demands on data architecture that traditional analytics workloads do not:
Real-time data access. Batch-oriented data pipelines that update data warehouses nightly are sufficient for analytics. AI features that need to reason about current state — customer status, inventory levels, recent transactions — require data that is fresh. This may require change data capture (CDC) pipelines that propagate changes in near-real-time.
Unstructured data support. AI's most impactful enterprise applications operate on unstructured content: documents, emails, support tickets, call transcripts. Data architectures designed for structured data must be extended — or supplemented with document stores and vector databases — to support these use cases.
Data lineage. For AI outputs to be auditable, the data that informed them must be traceable. What document did this RAG response retrieve? What training data produced this model's prediction? Data lineage tracking — recording the provenance of data as it flows through pipelines — is the infrastructure that makes AI auditability possible.
Consistent entity resolution. AI systems that reason across multiple data sources need a consistent representation of core entities — customers, products, contracts, employees. Without entity resolution (mapping different identifiers and representations of the same entity to a canonical form), AI systems reason about the same entity as if it were multiple distinct entities.
Governed access. AI systems that retrieve and reason about sensitive data must enforce the same access controls as the source systems. An AI assistant that can retrieve HR records it should not have access to is a compliance and reputational risk, regardless of whether a human would have been able to retrieve those records.
35.3 Feature Stores
A feature store is infrastructure for managing the features — processed, derived signals — that machine learning models use. It has two components: an offline store (historical feature values, used for model training) and an online store (current feature values, used for real-time inference).
The problem feature stores solve: Without a feature store, feature engineering code is duplicated between the training pipeline and the inference pipeline. If they diverge — different implementations, different data sources, different preprocessing logic — the model in production receives inputs that differ from what it was trained on, producing silent performance degradation.
What a feature store provides:
Feature definitions: a registry of named features, their computation logic, and their metadata (data source, update frequency, owner).
Offline serving: historical feature values for model training, with point-in-time correctness (the feature values that were available at the time of the training label, not future values).
Online serving: current feature values for real-time inference, with low-latency retrieval.
Feature reuse: teams across the organization can discover and reuse features without re-implementing the computation logic.
When you need a feature store: If you have multiple ML models that use the same features, if you have real-time AI features where training-serving skew is a risk, or if you need to govern and document the features your AI systems use for audit purposes, a feature store is worth the investment.
35.4 Data Governance for AI
AI amplifies data governance problems. A poorly governed database produces bad reports. A poorly governed training dataset produces a model that encodes and amplifies the data quality problems at scale.
Data catalog integration. Every dataset used for AI training or inference should be registered in the organization's data catalog — its source, its schema, its quality metrics, its update frequency, its owner, and its approved use cases. AI systems that operate on undocumented data cannot be audited.
Data quality gates. Automated checks that validate data quality before data reaches AI training or inference pipelines. Checks should include: schema validation (expected fields are present and typed correctly), completeness checks (null rates within acceptable bounds), consistency checks (referential integrity, value range constraints), and freshness checks (data is not stale beyond the acceptable window).
Sensitive data classification. AI training data and inference inputs may contain PII, PHI, financial data, or other sensitive categories. Data classification — labeling data with its sensitivity level — enables automated enforcement of handling requirements: encryption at rest, access logging, retention limits, prohibition on certain AI use cases.
AI-specific data use policies. Many datasets collected for one purpose are not appropriate for AI training without explicit consent or policy authorization. Data use policies specific to AI — which datasets can be used for training, which can be used for inference, which require special handling — must be documented and enforced before data reaches AI systems.
The organizations that build data governance for AI as a prerequisite, not as an afterthought, avoid the regulatory and reputational risks that have damaged early enterprise AI adopters. Data governance is not a barrier to AI — it is what makes AI trustworthy enough to deploy at enterprise scale.