Chapter 06 of 18

MLOps and AI Governance

Deploying AI models by hand, with no equivalent discipline to CI/CD, is how organizations accumulate expensive surprises. MLOps and AI governance are the practices that close that gap — from automated evaluation pipelines to the AI Register that tells you what is actually running in your enterprise.

12 min readOpen in Colab

Overview

Your CI/CD Pipeline Needs a Sibling

You would not dream of pushing code to production without automated builds, tests, and deployment pipelines. So here is the question worth losing sleep over: why are so many organizations still deploying AI models by hand, with no equivalent discipline?

The answer is that most teams have not yet internalized that AI models are production software — software that just happens to be written by data instead of developers. The discipline that fills this gap is called MLOps, and getting it right is one of the most important architectural decisions you will make in your AI journey. It is the difference between a one-off demo that wows the boardroom and a system that reliably creates value month after month, year after year.

MLOps — What It Actually Means

MLOps is the set of practices, tooling, and infrastructure that lets you deploy, monitor, and maintain AI models in production with the same confidence you have in your traditional software systems. DevOps for machine learning. It borrows the same philosophy of automation, reproducibility, and continuous improvement — adapted for the unique challenges that come with statistical models and data dependencies.

The MLOps Lifecycle

Notice the loop. Unlike traditional software, where you deploy a release and move on to the next feature, ML models live in a continuous cycle. The world changes, the data shifts, and the model that was accurate last quarter may be dangerously wrong this quarter. The closest analogy is your SDLC, but with two additions: data is a first-class input that must be versioned, validated, and tracked just like source code, and model drift is a first-class risk that must be monitored and mitigated just like a security vulnerability.

MLOps Maturity Levels

Level	Description	What You Need
0 — Manual	Data scientists train models in notebooks, hand off artifacts	Almost nothing (and almost no reliability)
1 — Pipeline	Automated training pipeline, manual deployment	Pipeline orchestrator, model registry
2 — CI/CD	Automated testing, deployment, and rollback	Full MLOps platform, monitoring
3 — Continuous	Automated retraining on data drift, self-healing	Advanced monitoring, feature store, A/B testing

Most enterprises should target Level 1 or Level 2 as their near-term goal. Level 0 is where almost everyone starts, and there is no shame in that — but you should not stay there long. The risks of manual, ad-hoc model deployment compound quickly once you have more than one or two models in production. Level 3 is genuinely only necessary for high-frequency, high-stakes models like ad ranking systems or real-time fraud detection. If someone tells you every model needs Level 3 maturity, they are probably selling you a platform.

For GenAI: LLMOps

For teams working with generative AI — building applications on top of large language models — you are usually not training models yourself. You are calling an API. But that does not mean you can skip operational discipline. The operational surface area is different but equally demanding.

MLOps Concept	LLMOps Equivalent
Model training	Prompt engineering / fine-tuning
Model registry	Prompt registry / template versioning
Model evaluation	Prompt evaluation (automated + human)
Data drift monitoring	Input distribution monitoring
Model serving	LLM API management (gateway, caching)
A/B testing	Prompt A/B testing

Your prompts are your models. When you change a system prompt, you are changing the behavior of your application just as fundamentally as if you had retrained a neural network. That means prompts deserve the same rigor you would apply to any other production artifact: version control, testing, staged rollouts, and the ability to roll back when something goes wrong.

The Model Registry

What It Is

If you have ever used a container registry like Docker Hub or ECR, you already understand the concept of a model registry. It is a versioned repository where you store trained models along with all the metadata needed to understand, reproduce, and deploy them. Just as you would not deploy a container without knowing which image tag you are running, you should not deploy a model without knowing exactly which version it is, what data it was trained on, and how it performed during evaluation.

What to Store

For each model version, your registry should capture a comprehensive record that tells the full story of how that model came to be and how it is performing. That means:

The model artifact itself — weights and configuration files
A reference to the exact training data used, including its version or content hash so you can reproduce the training run
Training hyperparameters — small changes in learning rate or batch size can have outsized effects on model behavior
Evaluation metrics — accuracy, precision, recall, latency, or whatever metrics matter for your use case; these are your deployment gates
A model card — a human-readable document explaining what the model does, what it was trained on, its known limitations, and its deployment status

Options

Registry	Type	Notes
MLflow	Open source	Most common, flexible
Vertex AI Model Registry	GCP managed	Integrated with Vertex
SageMaker Model Registry	AWS managed	Integrated with SageMaker
Weights & Biases	SaaS	Strong experiment tracking

Start with MLflow. It is open source, runs anywhere, and covers about ninety percent of what most teams need. You can migrate to a cloud-managed registry later if your platform strategy demands it, but MLflow gives you the flexibility to avoid lock-in while you are still figuring out your MLOps patterns. It is the most widely adopted tool in this space, so most data scientists already know how to use it.

Model Evaluation — Your AI Test Suite

Why It's Different

You cannot unit-test an AI model the way you test a function or an API endpoint. There is no assert statement that tells you whether a model is "correct," because models operate in a world of probabilities and trade-offs, not deterministic logic. Instead of testing individual predictions, you evaluate the model against a carefully curated test set and measure aggregate metrics that tell you whether the model's overall behavior meets your quality bar.

This requires different infrastructure. Versioned test datasets, automated evaluation pipelines, and clear pass/fail thresholds agreed upon by both the data science team and the business stakeholders who will ultimately rely on the model's predictions.

Evaluation Framework

The critical piece is the pass/fail gate at the end. Without it, evaluation is just an academic exercise. With it, you have a deployment guardrail that prevents bad models from reaching production — the same way a failing test suite should prevent bad code from being merged.

Key Metrics

Metric	Use Case	What It Tells You
Accuracy	Classification	% correct predictions
Precision/Recall	When errors have different costs	False positive vs. false negative trade-off
BLEU/ROUGE	Text generation	How close to reference text
LLM-as-Judge	GenAI quality	Use a stronger model to evaluate a weaker one
Latency p95	All	Worst-case response time
Cost per request	All	Operational cost

Pay special attention to that last row. Cost per request may not feel like a quality metric, but in the world of LLM-powered applications, it is. A model that gives perfect answers but costs ten dollars per request is not a viable production system. Treat cost as a first-class constraint, right alongside accuracy and latency.

Evaluation for GenAI

For LLM-based systems, evaluation requires a layered approach combining automated checks with human judgment.

Factuality is your first line of defense: does the response match known facts? For RAG-based systems, you can automate this by comparing the model's output against the source documents it was supposed to draw from. Relevance checks whether the response actually answers the user's question — even a factually correct response is useless if it is off-topic. Safety evaluation checks whether the response contains harmful, offensive, or inappropriate content. Format compliance ensures the response conforms to expected structure — when your application expects JSON conforming to a specific schema, a beautifully written response that breaks the schema is a production incident. Hallucination rate tracking measures how often the model invents facts not present in the source material, which is the single biggest trust risk in any GenAI deployment.

Monitoring in Production

What to Monitor

Deploying a model is the starting line, not the finish line. A model in production is a living system that can degrade, drift, and fail in ways far more subtle than a crashed server or a thrown exception.

Category	Metrics	Alert When
Performance	Latency, throughput, error rate	SLA breach
Quality	Accuracy on shadow labels, user feedback	Quality drops below threshold
Cost	Tokens per request, cost per request, daily spend	Budget threshold exceeded
Data	Input distribution shift, new categories appearing	Distribution drift detected
Safety	Blocked requests, flagged outputs	Spike in safety violations

These categories are interconnected. A shift in input distribution often leads to a drop in quality — but not always immediately. There can be a lag of days or weeks before degraded inputs produce visibly degraded outputs. This is why monitoring all five dimensions is essential. If you only watch latency and error rates, you will catch infrastructure problems but miss the slow, silent quality degradation that is the hallmark of model drift.

Model Drift

Drift is the failure mode that catches most teams off guard. Models degrade over time, not because anything is wrong with the model itself, but because the world changes. Customer behavior shifts with the seasons, the economy, cultural trends. New products launch and create categories that did not exist when the model was trained. Regulations change and alter what is permissible.

Detecting drift requires comparing current input and output distributions against the distributions the model saw during training. Statistical tests like KL divergence and Population Stability Index can automate this comparison and fire alerts when distributions diverge beyond acceptable thresholds. Set these thresholds thoughtfully — too sensitive and you will drown in false alarms; too loose and you will miss real degradation until a business stakeholder calls asking why predictions have gone haywire.

Once you detect drift, the response depends on your architecture. For traditional ML models, the answer is usually to retrain on more recent data. For RAG-based GenAI systems, drift often manifests as stale knowledge, and the fix is to update your knowledge base with current documents rather than retraining the underlying language model.

AI Governance Framework

Governance Structure

AI governance done well is not bureaucratic theater. The risks of ungoverned AI are financial, legal, and reputational — and they scale with the number of AI systems you have in production.

A practical AI governance framework has three layers: policies that define the rules, standards that make those rules specific and actionable, and implementation that embeds those standards into your actual infrastructure and workflows.

The AI Governance Board should include your CTO or VP of Engineering, Chief Data Officer, representatives from Legal and Ethics, and your Enterprise Architecture lead. This is not a committee that meets once a quarter to rubber-stamp decisions. It is an active body that reviews high-risk AI deployments, sets policy, and owns the incident response playbook when something goes wrong. The EA lead's role on this board is particularly important — you are the one who sees the full portfolio of AI systems across the enterprise and can spot patterns, redundancies, and risks that individual teams miss.

Risk Classification

Not every model needs the same level of oversight. An internal tool that summarizes meeting notes carries a fundamentally different risk profile than a system that makes lending decisions or provides medical advice.

Risk Level	Examples	Governance Required
Low	Internal content summary, search	Model card, basic monitoring
Medium	Customer-facing chatbot, recommendations	+ evaluation suite, human review, A/B testing
High	Credit decisions, medical, legal	+ bias testing, explainability, regulatory review
Critical	Autonomous actions, safety-critical	+ formal verification, continuous audit, HITL

A tiered approach lets you move fast where the risk is low while applying rigorous oversight where the stakes are high. Low-risk internal tools can be deployed with a model card and basic monitoring. High-risk systems — anything touching financial decisions, health outcomes, or legal matters — should go through bias testing, explainability review, and regulatory compliance checks before they see a single production request. This is not about slowing innovation. It is about concentrating your governance energy where it matters most.

The AI Register

Do you actually know every AI system running in your enterprise? If the answer is "not really," you are not alone — but you should be alarmed. Shadow AI, models and LLM-powered tools deployed by individual teams without central visibility, is the new shadow IT. It carries the same risks of data leakage, compliance violations, and architectural sprawl.

The solution is an AI Register: a central, maintained catalog of every AI system in your organization. For each system, capture: the system name and its business owner (every AI system needs a human who is accountable for its behavior), the risk classification, the data sources the system uses, model details including provider and version, the last evaluation date and results, known limitations or failure modes, and incident history.

This is essentially your enterprise application portfolio — the same artifact enterprise architects have maintained for decades — extended to cover AI components. Treat it as a living document. If a system is not in the register, it should not be in production.

Real-World Example: The Retailer's MLOps Journey

A major retailer with two hundred stores deployed AI for demand forecasting — predicting how much of each product to stock in each store each week. A classic, high-value ML use case.

In the first three months, data scientists built models in Jupyter notebooks, working with whatever data was convenient, deploying models by manually copying artifacts to a production server. No monitoring, no evaluation gates, no governance. The models worked well enough during normal periods, but when the holiday season arrived, the forecasting model predicted demand based on 2019 patterns — pre-COVID patterns — because nobody had noticed that the training data had not been updated to reflect the massive shifts in consumer behavior that had occurred since then. The result was two million dollars in overstock losses. Two million dollars, because nobody had a drift detection system that would have flagged the stale training data.

During months four through six, the team built a proper MLOps pipeline. They automated training to pull in recent data, added drift detection to flag when input distributions shifted meaningfully, and set the model to retrain monthly rather than whenever someone remembered to do it. This eliminated the category of failure that had caused the overstock incident.

In months seven through nine, they added evaluation gates. No model could be deployed to production without passing accuracy thresholds on a regularly refreshed held-out test set. They introduced A/B testing, where new models would serve only ten percent of forecasting requests initially and only graduate to full traffic once they had proven themselves against the incumbent. This gave them confidence that model updates were genuine improvements, not regressions dressed up in better training metrics.

By months ten through twelve, they established governance around the entire process. Each forecasting model had a model card. Monthly reviews with business stakeholders ensured alignment with changing priorities. An incident response playbook defined exactly what to do when predictions were significantly off — who to notify, how to diagnose root cause, how to roll back to a previous model version.

The total cost of the MLOps platform was about two hundred thousand dollars per year. The cost of the single overstock incident it would have prevented was two million dollars. The math is not subtle.

Companion Notebook

— Build a simple MLOps pipeline: train a model, log to MLflow, evaluate against a test set, register the model, and simulate a deployment decision gate.

← Back to Architecting AI in the Real Enterprise — Revised