Quick Reference 14

MLOps for GenAI

Quick reference for CI/CD for ML, model versioning, A/B testing, canary deployment, drift detection, and monitoring.

10 min readMLOpsQuick ReferenceDownload PDF

MLOps vs Traditional DevOps

MLOps extends DevOps with ML-specific concerns: model versioning, data validation, and drift detection. GenAI adds yet another layer -- prompt versioning, output quality monitoring, and cost tracking. Understanding these differences prevents you from applying the wrong operational playbook.

AspectDevOpsMLOpsGenAI Ops
ArtifactCodeCode + Model + DataCode + Prompts + Configs
TestingUnit/integration+ Data validation, model eval+ Prompt eval, safety tests
VersioningCode (git)+ Model registry, data versions+ Prompt versions, eval sets
MonitoringUptime, latency+ Model performance, drift+ Output quality, cost, safety
RollbackDeploy previous code+ Rollback model version+ Rollback prompt version
CI triggerCode commit+ Data change, schedule+ Prompt change, eval regression

CI/CD Pipeline for ML

Manual model training and deployment is the #1 bottleneck in ML teams. A CI/CD pipeline automates the path from code commit to production deployment, with quality gates that prevent bad models from reaching users.

┌────────────┐    ┌────────────┐    ┌────────────┐    ┌────────────┐
│   Code     │    │   Build &  │    │   Model    │    │   Deploy   │
│   Commit   │───>│   Test     │───>│   Train    │───>│   & Serve  │
│            │    │            │    │   & Eval   │    │            │
└────────────┘    └────────────┘    └────────────┘    └────────────┘
     │                 │                  │                  │
     ▼                 ▼                  ▼                  ▼
  Git push       - Lint code        - Train model      - Canary deploy
  PR review      - Unit tests       - Evaluate          - A/B test
                 - Data validation  - Compare baseline  - Monitor
                 - Schema check     - Bias audit        - Auto-rollback

CI Pipeline Steps

StepToolPurpose
Code lintruff, black, mypyCode quality
Unit testspytestLogic correctness
Data validationGreat Expectations, PydanticSchema, distributions, nulls
TrainingVertex AI, SageMaker, customModel fitting
Model evaluationCustom metrics + frameworkQuality gate
Bias checkFairlearn, AequitasFairness gate
Security scanTrivy, SnykContainer/dependency vulnerabilities
Prompt eval (GenAI)Promptfoo, customPrompt quality gate
Integration testCustomEnd-to-end pipeline validation

Example GitHub Actions ML Pipeline

name: ML Pipeline
on:
  push:
    branches: [main]
    paths: ['model/**', 'data/**', 'prompts/**']

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run unit tests
        run: pytest tests/ -v
      - name: Validate data schema
        run: python scripts/validate_data.py
      - name: Run prompt evaluations
        run: python scripts/eval_prompts.py --threshold 0.85

  train:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Train model
        run: python scripts/train.py --config configs/prod.yaml
      - name: Evaluate model
        run: python scripts/evaluate.py --min-accuracy 0.92
      - name: Register model
        run: python scripts/register_model.py --tag candidate

  deploy:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - name: Canary deploy (10%)
        run: python scripts/deploy.py --traffic 10 --tag candidate
      - name: Run smoke tests
        run: python scripts/smoke_test.py
      - name: Promote to full traffic
        run: python scripts/deploy.py --traffic 100 --tag candidate

Model Versioning

If you cannot reproduce a model from six months ago, you cannot debug it, audit it, or roll back to it. Version everything -- code, data, weights, hyperparameters, and prompts.

What to Version

ArtifactToolWhy
CodeGitReproducibility
Training dataDVC, Delta Lake, GCS versioned bucketsData lineage
Model weightsMLflow, Vertex AI Model Registry, W&BRollback, comparison
HyperparametersMLflow, config files in gitReproducibility
Evaluation resultsMLflow, databaseQuality tracking
Prompts (GenAI)Git, prompt management toolVersion + eval tracking
Eval datasetsGit, DVCConsistent benchmarking
EnvironmentDocker image tagsReproducibility

MLflow Model Registry

import mlflow

# Log model during training
with mlflow.start_run():
    mlflow.log_params({"lr": 0.01, "epochs": 10})
    mlflow.log_metrics({"accuracy": 0.94, "f1": 0.91})
    mlflow.sklearn.log_model(model, "model", registered_model_name="fraud-detector")

# Promote model version
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud-detector",
    version=3,
    stage="Production"
)

Prompt Versioning (GenAI-Specific)

In GenAI systems, prompts are code -- a one-word change can dramatically alter output quality. Version prompts in git alongside their evaluation datasets so every change is tracked, testable, and reversible.

Prompt Management Pattern

prompts/
  ├── summarize/
  │   ├── v1.yaml
  │   ├── v2.yaml         # current production
  │   ├── v3.yaml         # candidate
  │   └── eval_set.jsonl  # evaluation dataset
  ├── classify/
  │   ├── v1.yaml
  │   └── eval_set.jsonl
  └── config.yaml          # which version is active per env

Prompt Config Example

# prompts/summarize/v2.yaml
name: summarize
version: 2
model: claude-sonnet-4
temperature: 0.3
max_tokens: 500
system: |
  You are a concise summarizer. Extract key points from the given text.
  Return exactly 3 bullet points, each under 20 words.
template: |
  Summarize the following text:

  {text}
eval_criteria:
  - name: bullet_count
    type: format_check
    expected: 3
  - name: relevance
    type: llm_judge
    threshold: 0.8

A/B Testing

A/B testing is the only reliable way to measure whether a new model actually improves business outcomes in production. Without it, you are deploying based on offline metrics that may not correlate with real-world performance.

A/B Test Architecture

              Load Balancer
              (traffic split)
             /              \
       90% traffic      10% traffic
            │                │
     ┌──────▼──────┐  ┌─────▼───────┐
     │  Model A    │  │  Model B    │
     │ (champion)  │  │ (candidate) │
     └──────┬──────┘  └──────┬──────┘
            │                │
            └──────┬─────────┘
                   ▼
            Metrics Collection
            & Statistical Test

A/B Test Checklist

StepAction
1. Define hypothesis"Model B improves metric X by Y%"
2. Choose metricsPrimary (business), secondary (model quality), guardrail (safety)
3. Calculate sample sizePower analysis: significance level, power, minimum detectable effect
4. Randomize trafficConsistent hashing by user/session
5. Run experimentCollect data for calculated duration
6. Analyze resultsStatistical significance test
7. DecidePromote, iterate, or discard

Statistical Significance

from scipy import stats

# Two-proportion z-test (for conversion rates)
def ab_test_proportions(successes_a, total_a, successes_b, total_b, alpha=0.05):
    p_a = successes_a / total_a
    p_b = successes_b / total_b
    p_pool = (successes_a + successes_b) / (total_a + total_b)

    se = (p_pool * (1 - p_pool) * (1/total_a + 1/total_b)) ** 0.5
    z = (p_b - p_a) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    return {
        "rate_a": p_a, "rate_b": p_b,
        "lift": (p_b - p_a) / p_a,
        "z_score": z, "p_value": p_value,
        "significant": p_value < alpha
    }

Canary Deployment

Canary deployments let you validate a new model on a small percentage of live traffic before rolling it out fully. Combined with auto-rollback triggers, they prevent bad deployments from impacting more than a fraction of users.

Canary Stages

StageTraffic %DurationGate
Deploy0% (shadow)1 hourLogs clean, no errors
Canary5%2-4 hoursError rate < baseline + 0.5%
Expand25%4-12 hoursLatency P95 < baseline + 10%
Expand50%12-24 hoursBusiness metrics stable
Full100%OngoingAll metrics healthy

Auto-Rollback Triggers

SignalThresholdAction
Error rate> 5%Immediate rollback
Latency P95> 2x baselineRollback
Guardrail trigger rate> 3x baselineRollback
User feedback negative> 2x baselinePause, investigate
Cost per request> 2x baselinePause, investigate

Drift Detection

The real world changes constantly -- customer behavior shifts, data pipelines break, and upstream schemas evolve. Drift detection alerts you when your model's operating conditions have diverged from what it was trained on.

Types of Drift

TypeWhat ChangesDetection Method
Data driftInput feature distributionsPSI, KL divergence, KS test
Concept driftRelationship between features and targetModel performance metrics
Prediction driftOutput distributionPSI on predictions
Label driftGround truth distributionMonitor label feedback
Upstream driftData pipeline changesSchema validation

Population Stability Index (PSI)

import numpy as np

def calculate_psi(expected, actual, bins=10):
    """
    PSI < 0.1: No significant change
    PSI 0.1-0.25: Moderate change, investigate
    PSI > 0.25: Significant change, action needed
    """
    breakpoints = np.quantile(expected, np.linspace(0, 1, bins + 1))
    breakpoints[0] = -np.inf
    breakpoints[-1] = np.inf

    expected_counts = np.histogram(expected, bins=breakpoints)[0] / len(expected)
    actual_counts = np.histogram(actual, bins=breakpoints)[0] / len(actual)

    # Avoid division by zero
    expected_counts = np.clip(expected_counts, 0.001, None)
    actual_counts = np.clip(actual_counts, 0.001, None)

    psi = np.sum((actual_counts - expected_counts) * np.log(actual_counts / expected_counts))
    return psi

KL Divergence

from scipy.special import rel_entr

def kl_divergence(p, q):
    """
    KL(P || Q): How much P differs from Q.
    Not symmetric: KL(P||Q) != KL(Q||P)
    """
    return sum(rel_entr(p, q))

# Jensen-Shannon divergence (symmetric version)
from scipy.spatial.distance import jensenshannon

js_div = jensenshannon(distribution_train, distribution_prod)
# JS < 0.1: similar, JS > 0.3: investigate

Kolmogorov-Smirnov Test

from scipy.stats import ks_2samp

stat, p_value = ks_2samp(training_feature, production_feature)
if p_value < 0.05:
    print(f"Significant drift detected (KS stat: {stat:.3f}, p: {p_value:.4f})")

Monitoring Metrics

Monitoring is your early warning system for every type of production failure -- performance degradation, cost spikes, safety violations, and user dissatisfaction. GenAI systems require additional metrics beyond traditional ML.

Traditional ML Monitoring

CategoryMetricAlert Condition
PerformanceAccuracy, F1, AUCDrop > 5% from baseline
LatencyP50, P95, P99P95 > SLA threshold
ThroughputRequests/sec> 80% capacity
Error rateFailed predictions> 1%
Data quality% null, schema violationsAny violation
Feature driftPSI per featurePSI > 0.25
Prediction driftPSI on outputsPSI > 0.2

GenAI-Specific Monitoring

CategoryMetricAlert Condition
QualityLLM-as-judge score (sampled)Average < 0.7
FaithfulnessGrounding score (RAG)< 0.8
SafetyGuardrail trigger rate> 5%
CostToken usage per request> 2x expected
CostDaily/weekly spend> budget threshold
User experienceThumbs up/down ratio< 80% positive
LatencyTime to first token> 2s
LatencyTotal response timeP95 > 10s

Monitoring Architecture

Application
    │
    ├── Structured logs ──> Log aggregation (ELK, CloudWatch, Cloud Logging)
    │
    ├── Metrics ──────────> Metrics store (Prometheus, Cloud Monitoring)
    │                              │
    │                        ┌─────▼─────┐
    │                        │ Dashboards │
    │                        │ (Grafana)  │
    │                        └─────┬─────┘
    │                              │
    │                        ┌─────▼─────┐
    │                        │  Alerts   │
    │                        │ (PagerDuty)│
    │                        └───────────┘
    │
    ├── Traces ───────────> Tracing (Jaeger, Cloud Trace)
    │
    └── Samples ──────────> Quality eval (offline, async)

MLOps Maturity Levels

Most teams start at Level 0 and that is fine -- the goal is to know where you are and what to invest in next. Jumping straight to Level 4 wastes effort; iterate incrementally toward your target maturity.

LevelDescriptionPractices
0ManualManual training, manual deployment, no monitoring
1ML PipelineAutomated training pipeline, manual deployment
2CI/CD for MLAutomated training + deployment, basic monitoring
3Full MLOpsAuto-retraining on drift, A/B testing, full observability
4GenAI MLOps+ Prompt versioning, eval pipelines, safety monitoring, cost tracking

GenAI-Specific CI/CD

GenAI pipelines need a new type of quality gate: prompt evaluation. Every prompt change should trigger automated format checks, relevance scoring, and safety tests before reaching production.

Prompt Evaluation Pipeline

# Run as CI step on prompt changes
def evaluate_prompt_change(prompt_config, eval_dataset):
    results = []
    for example in eval_dataset:
        response = call_llm(prompt_config, example["input"])
        scores = {
            "format_valid": check_format(response, prompt_config.format_spec),
            "relevance": llm_judge_relevance(example["input"], response),
            "safety": safety_check(response),
            "matches_reference": semantic_similarity(response, example["expected"])
        }
        results.append(scores)

    avg_scores = {k: sum(r[k] for r in results) / len(results) for k in results[0]}

    # Quality gate
    assert avg_scores["format_valid"] > 0.95, "Format compliance too low"
    assert avg_scores["relevance"] > 0.80, "Relevance below threshold"
    assert avg_scores["safety"] > 0.99, "Safety check failures"

    return avg_scores

Common Pitfalls

The most expensive MLOps mistakes are the ones you discover in production -- no rollback plan, no drift monitoring, no cost tracking. Invest in these foundations before they become emergencies.

PitfallProblemFix
No model versioningCan't rollbackUse model registry from day one
Manual deploymentsSlow, error-proneAutomate with CI/CD
No data validationBad data trains bad modelsAdd Great Expectations or Pydantic checks
No drift monitoringSilent quality degradationPSI/KS checks on schedule
No cost tracking (GenAI)Surprise billsPer-request token tracking + budget alerts
Testing only accuracyMisses fairness, safetyInclude bias + safety in eval
No prompt versioning (GenAI)Can't reproduce or rollbackVersion prompts in git with eval sets
Evaluating onceQuality changes over timeContinuous eval in production
No rollback planStuck with bad deploymentPre-define rollback triggers and process
Overcomplicating earlySlow progressStart at Level 1, iterate to Level 3+