Chapter 37 of 75

MLOps and AI Governance

MLOps is the engineering discipline that makes AI systems operable at enterprise scale. Without it, models live in notebooks and die in staging. With it, AI capabilities become reliable, auditable, and improvable production assets.

4 min read

Part IV — Enterprise AI Architecture

MLOps and AI Governance

Machine learning models do not automatically become production systems. The gap between a data scientist's notebook and a reliable, monitored, governable AI capability in production is the MLOps gap — and it is why organizations that invest heavily in AI research often fail to deliver AI products. MLOps is the engineering discipline that closes this gap, and AI governance is the organizational layer that ensures the resulting systems are trustworthy.

37.1 The MLOps Maturity Model

Level 0 — Manual. Data scientists build models in notebooks, export them manually, and hand them to engineering teams for deployment. No automated pipelines, no systematic evaluation, no monitoring. Models are static — they are not retrained after deployment. This is the state of most organizations' first AI projects.

Level 1 — ML pipeline automation. Training pipelines are automated: new data triggers retraining, which runs through automated preprocessing, training, and evaluation steps. Models are deployed automatically when evaluation passes. Monitoring is in place for model performance degradation. This level enables regular model updates without manual intervention.

Level 2 — CI/CD for ML. The full software engineering CI/CD discipline is applied to ML: changes to model code, feature engineering, and training configuration trigger automated test suites. Models are versioned like software. Deployment is gated by automated quality checks. Rollback is automated. This level makes ML development as reliable as software development.

Most enterprise AI capabilities can operate effectively at Level 1. Level 2 is appropriate for AI capabilities that are central to business operations and where reliability is critical. Level 0 is not appropriate for any production AI system.

37.2 Core MLOps Components

Model registry. A centralized catalog of trained model versions, their evaluation metrics, their training data versions, and their deployment status. The model registry is the single source of truth for which model is in production and how it got there. Every enterprise AI system should have model registry backing.

Feature store. Covered in Chapter 35. The feature store ensures consistency between training-time features and inference-time features — the training-serving skew problem that silently degrades model performance in production.

Training pipeline. Automated, reproducible training runs that start from versioned data, run versioned code, and produce versioned model artifacts. Reproducibility — the ability to re-run a past training run and get the same model — is the foundation of debugging and auditing.

Evaluation gate. Automated evaluation of model quality before deployment. The gate compares the candidate model's performance on the evaluation dataset against the current production model's performance. A candidate that does not beat (or at least match) the current model is not deployed.

Serving infrastructure. The infrastructure that makes trained models available for inference — model serving endpoints, load balancing, auto-scaling, caching, and request routing. Serving infrastructure must handle the latency requirements, throughput requirements, and availability requirements of the production use case.

Model monitoring. Continuous measurement of model quality and input distribution in production. Chapter 32 covers monitoring in detail. In the MLOps context, monitoring must feed back into the training pipeline — when monitoring detects degradation, it should trigger retraining.

37.3 AI Governance Structures

MLOps handles the technical operability of AI systems. AI governance handles the organizational accountability — ensuring that AI systems are approved, monitored, and remediable by accountable humans.

AI inventory. A registry of all AI systems in production: who owns them, what they do, what data they use, what populations they affect, and when they were last reviewed. Organizations without an AI inventory often discover AI systems in unexpected places during regulatory inquiries.

Risk tiering. Not all AI systems carry the same risk. A recommendation engine for internal content carries different risk than an AI that makes credit decisions or clinical diagnoses. Risk tiering assigns governance requirements based on risk level — low-risk systems get periodic review; high-risk systems get continuous monitoring, explainability requirements, and human oversight for consequential outputs.

Model approval process. A defined process by which new AI capabilities are approved for production — what reviews are required, who can approve, what documentation must be produced. Model approval processes prevent unauthorized AI deployment and create an audit trail for regulatory purposes.

Governance committee. A cross-functional body — legal, compliance, data privacy, business, and AI technical leads — that sets AI policy, reviews high-risk AI systems, and owns remediation decisions when AI systems cause harm. Governance committees without technical members set policies that cannot be implemented. Technical bodies without business and legal members miss the organizational and regulatory context.

37.4 Designing for Auditability

AI governance requires auditability: the ability to explain, after the fact, what data produced what model, what model produced what output, and what output affected what decision. Building auditability in from the start is dramatically cheaper than retrofitting it.

Input logging. Log every inference input — not just the output. For regulated AI systems, the input-output pair is the audit record. Systems that log outputs but not inputs cannot be audited.

Decision provenance. For AI systems that inform consequential decisions, log the complete chain: what inputs triggered what model, what model produced what output, what output was shown to what user, what decision the user made. Provenance chains are what regulators ask for when an AI-influenced decision is disputed.

Model version pinning. Record which model version produced each inference output. Without this, it is impossible to investigate past outputs — the model that produced a suspicious output six months ago may have been updated multiple times since.

Data lineage. Record which data sources and which versions of those data sources contributed to each model's training. Data lineage enables investigation of training data quality issues that manifest as model failures.

Auditability is not just a compliance requirement — it is what allows engineering teams to debug AI failures effectively. The operational investment pays for itself the first time a production AI failure must be investigated under time pressure.

← Back to AI Beyond the Demo — Revised