MLOps for GenAI | LegacyForward.ai

MLOps vs Traditional DevOps

MLOps extends DevOps with ML-specific concerns: model versioning, data validation, and drift detection. GenAI adds yet another layer -- prompt versioning, output quality monitoring, and cost tracking. Understanding these differences prevents you from applying the wrong operational playbook.

Aspect	DevOps	MLOps	GenAI Ops
Artifact	Code	Code + Model + Data	Code + Prompts + Configs
Testing	Unit/integration	+ Data validation, model eval	+ Prompt eval, safety tests
Versioning	Code (git)	+ Model registry, data versions	+ Prompt versions, eval sets
Monitoring	Uptime, latency	+ Model performance, drift	+ Output quality, cost, safety
Rollback	Deploy previous code	+ Rollback model version	+ Rollback prompt version
CI trigger	Code commit	+ Data change, schedule	+ Prompt change, eval regression

CI/CD Pipeline for ML

Manual model training and deployment is the #1 bottleneck in ML teams. A CI/CD pipeline automates the path from code commit to production deployment, with quality gates that prevent bad models from reaching users.

┌────────────┐    ┌────────────┐    ┌────────────┐    ┌────────────┐
│   Code     │    │   Build &  │    │   Model    │    │   Deploy   │
│   Commit   │───>│   Test     │───>│   Train    │───>│   & Serve  │
│            │    │            │    │   & Eval   │    │            │
└────────────┘    └────────────┘    └────────────┘    └────────────┘
     │                 │                  │                  │
     ▼                 ▼                  ▼                  ▼
  Git push       - Lint code        - Train model      - Canary deploy
  PR review      - Unit tests       - Evaluate          - A/B test
                 - Data validation  - Compare baseline  - Monitor
                 - Schema check     - Bias audit        - Auto-rollback

CI Pipeline Steps

Step	Tool	Purpose
Code lint	ruff, black, mypy	Code quality
Unit tests	pytest	Logic correctness
Data validation	Great Expectations, Pydantic	Schema, distributions, nulls
Training	Vertex AI, SageMaker, custom	Model fitting
Model evaluation	Custom metrics + framework	Quality gate
Bias check	Fairlearn, Aequitas	Fairness gate
Security scan	Trivy, Snyk	Container/dependency vulnerabilities
Prompt eval (GenAI)	Promptfoo, custom	Prompt quality gate
Integration test	Custom	End-to-end pipeline validation

Example GitHub Actions ML Pipeline

name: ML Pipeline
on:
  push:
    branches: [main]
    paths: ['model/**', 'data/**', 'prompts/**']

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run unit tests
        run: pytest tests/ -v
      - name: Validate data schema
        run: python scripts/validate_data.py
      - name: Run prompt evaluations
        run: python scripts/eval_prompts.py --threshold 0.85

  train:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Train model
        run: python scripts/train.py --config configs/prod.yaml
      - name: Evaluate model
        run: python scripts/evaluate.py --min-accuracy 0.92
      - name: Register model
        run: python scripts/register_model.py --tag candidate

  deploy:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - name: Canary deploy (10%)
        run: python scripts/deploy.py --traffic 10 --tag candidate
      - name: Run smoke tests
        run: python scripts/smoke_test.py
      - name: Promote to full traffic
        run: python scripts/deploy.py --traffic 100 --tag candidate

Model Versioning

If you cannot reproduce a model from six months ago, you cannot debug it, audit it, or roll back to it. Version everything -- code, data, weights, hyperparameters, and prompts.

What to Version

Artifact	Tool	Why
Code	Git	Reproducibility
Training data	DVC, Delta Lake, GCS versioned buckets	Data lineage
Model weights	MLflow, Vertex AI Model Registry, W&B	Rollback, comparison
Hyperparameters	MLflow, config files in git	Reproducibility
Evaluation results	MLflow, database	Quality tracking
Prompts (GenAI)	Git, prompt management tool	Version + eval tracking
Eval datasets	Git, DVC	Consistent benchmarking
Environment	Docker image tags	Reproducibility

MLflow Model Registry

import mlflow

# Log model during training
with mlflow.start_run():
    mlflow.log_params({"lr": 0.01, "epochs": 10})
    mlflow.log_metrics({"accuracy": 0.94, "f1": 0.91})
    mlflow.sklearn.log_model(model, "model", registered_model_name="fraud-detector")

# Promote model version
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud-detector",
    version=3,
    stage="Production"
)

Prompt Versioning (GenAI-Specific)

In GenAI systems, prompts are code -- a one-word change can dramatically alter output quality. Version prompts in git alongside their evaluation datasets so every change is tracked, testable, and reversible.

Prompt Management Pattern

prompts/
  ├── summarize/
  │   ├── v1.yaml
  │   ├── v2.yaml         # current production
  │   ├── v3.yaml         # candidate
  │   └── eval_set.jsonl  # evaluation dataset
  ├── classify/
  │   ├── v1.yaml
  │   └── eval_set.jsonl
  └── config.yaml          # which version is active per env

Prompt Config Example

# prompts/summarize/v2.yaml
name: summarize
version: 2
model: claude-sonnet-4
temperature: 0.3
max_tokens: 500
system: |
  You are a concise summarizer. Extract key points from the given text.
  Return exactly 3 bullet points, each under 20 words.
template: |
  Summarize the following text:

  {text}
eval_criteria:
  - name: bullet_count
    type: format_check
    expected: 3
  - name: relevance
    type: llm_judge
    threshold: 0.8

A/B Testing

A/B testing is the only reliable way to measure whether a new model actually improves business outcomes in production. Without it, you are deploying based on offline metrics that may not correlate with real-world performance.

A/B Test Architecture

              Load Balancer
              (traffic split)
             /              \
       90% traffic      10% traffic
            │                │
     ┌──────▼──────┐  ┌─────▼───────┐
     │  Model A    │  │  Model B    │
     │ (champion)  │  │ (candidate) │
     └──────┬──────┘  └──────┬──────┘
            │                │
            └──────┬─────────┘
                   ▼
            Metrics Collection
            & Statistical Test

A/B Test Checklist

Step	Action
1. Define hypothesis	"Model B improves metric X by Y%"
2. Choose metrics	Primary (business), secondary (model quality), guardrail (safety)
3. Calculate sample size	Power analysis: significance level, power, minimum detectable effect
4. Randomize traffic	Consistent hashing by user/session
5. Run experiment	Collect data for calculated duration
6. Analyze results	Statistical significance test
7. Decide	Promote, iterate, or discard

Statistical Significance

from scipy import stats

# Two-proportion z-test (for conversion rates)
def ab_test_proportions(successes_a, total_a, successes_b, total_b, alpha=0.05):
    p_a = successes_a / total_a
    p_b = successes_b / total_b
    p_pool = (successes_a + successes_b) / (total_a + total_b)

    se = (p_pool * (1 - p_pool) * (1/total_a + 1/total_b)) ** 0.5
    z = (p_b - p_a) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    return {
        "rate_a": p_a, "rate_b": p_b,
        "lift": (p_b - p_a) / p_a,
        "z_score": z, "p_value": p_value,
        "significant": p_value < alpha
    }

Canary Deployment

Canary deployments let you validate a new model on a small percentage of live traffic before rolling it out fully. Combined with auto-rollback triggers, they prevent bad deployments from impacting more than a fraction of users.

Canary Stages

Stage	Traffic %	Duration	Gate
Deploy	0% (shadow)	1 hour	Logs clean, no errors
Canary	5%	2-4 hours	Error rate < baseline + 0.5%
Expand	25%	4-12 hours	Latency P95 < baseline + 10%
Expand	50%	12-24 hours	Business metrics stable
Full	100%	Ongoing	All metrics healthy

Auto-Rollback Triggers

Signal	Threshold	Action
Error rate	> 5%	Immediate rollback
Latency P95	> 2x baseline	Rollback
Guardrail trigger rate	> 3x baseline	Rollback
User feedback negative	> 2x baseline	Pause, investigate
Cost per request	> 2x baseline	Pause, investigate

Drift Detection

The real world changes constantly -- customer behavior shifts, data pipelines break, and upstream schemas evolve. Drift detection alerts you when your model's operating conditions have diverged from what it was trained on.

Types of Drift

Type	What Changes	Detection Method
Data drift	Input feature distributions	PSI, KL divergence, KS test
Concept drift	Relationship between features and target	Model performance metrics
Prediction drift	Output distribution	PSI on predictions
Label drift	Ground truth distribution	Monitor label feedback
Upstream drift	Data pipeline changes	Schema validation

Population Stability Index (PSI)

import numpy as np

def calculate_psi(expected, actual, bins=10):
    """
    PSI < 0.1: No significant change
    PSI 0.1-0.25: Moderate change, investigate
    PSI > 0.25: Significant change, action needed
    """
    breakpoints = np.quantile(expected, np.linspace(0, 1, bins + 1))
    breakpoints[0] = -np.inf
    breakpoints[-1] = np.inf

    expected_counts = np.histogram(expected, bins=breakpoints)[0] / len(expected)
    actual_counts = np.histogram(actual, bins=breakpoints)[0] / len(actual)

    # Avoid division by zero
    expected_counts = np.clip(expected_counts, 0.001, None)
    actual_counts = np.clip(actual_counts, 0.001, None)

    psi = np.sum((actual_counts - expected_counts) * np.log(actual_counts / expected_counts))
    return psi

KL Divergence

from scipy.special import rel_entr

def kl_divergence(p, q):
    """
    KL(P || Q): How much P differs from Q.
    Not symmetric: KL(P||Q) != KL(Q||P)
    """
    return sum(rel_entr(p, q))

# Jensen-Shannon divergence (symmetric version)
from scipy.spatial.distance import jensenshannon

js_div = jensenshannon(distribution_train, distribution_prod)
# JS < 0.1: similar, JS > 0.3: investigate

Kolmogorov-Smirnov Test

from scipy.stats import ks_2samp

stat, p_value = ks_2samp(training_feature, production_feature)
if p_value < 0.05:
    print(f"Significant drift detected (KS stat: {stat:.3f}, p: {p_value:.4f})")

Monitoring Metrics

Monitoring is your early warning system for every type of production failure -- performance degradation, cost spikes, safety violations, and user dissatisfaction. GenAI systems require additional metrics beyond traditional ML.

Traditional ML Monitoring

Category	Metric	Alert Condition
Performance	Accuracy, F1, AUC	Drop > 5% from baseline
Latency	P50, P95, P99	P95 > SLA threshold
Throughput	Requests/sec	> 80% capacity
Error rate	Failed predictions	> 1%
Data quality	% null, schema violations	Any violation
Feature drift	PSI per feature	PSI > 0.25
Prediction drift	PSI on outputs	PSI > 0.2

GenAI-Specific Monitoring

Category	Metric	Alert Condition
Quality	LLM-as-judge score (sampled)	Average < 0.7
Faithfulness	Grounding score (RAG)	< 0.8
Safety	Guardrail trigger rate	> 5%
Cost	Token usage per request	> 2x expected
Cost	Daily/weekly spend	> budget threshold
User experience	Thumbs up/down ratio	< 80% positive
Latency	Time to first token	> 2s
Latency	Total response time	P95 > 10s

Monitoring Architecture

Application
    │
    ├── Structured logs ──> Log aggregation (ELK, CloudWatch, Cloud Logging)
    │
    ├── Metrics ──────────> Metrics store (Prometheus, Cloud Monitoring)
    │                              │
    │                        ┌─────▼─────┐
    │                        │ Dashboards │
    │                        │ (Grafana)  │
    │                        └─────┬─────┘
    │                              │
    │                        ┌─────▼─────┐
    │                        │  Alerts   │
    │                        │ (PagerDuty)│
    │                        └───────────┘
    │
    ├── Traces ───────────> Tracing (Jaeger, Cloud Trace)
    │
    └── Samples ──────────> Quality eval (offline, async)

MLOps Maturity Levels

Most teams start at Level 0 and that is fine -- the goal is to know where you are and what to invest in next. Jumping straight to Level 4 wastes effort; iterate incrementally toward your target maturity.

Level	Description	Practices
0	Manual	Manual training, manual deployment, no monitoring
1	ML Pipeline	Automated training pipeline, manual deployment
2	CI/CD for ML	Automated training + deployment, basic monitoring
3	Full MLOps	Auto-retraining on drift, A/B testing, full observability
4	GenAI MLOps	+ Prompt versioning, eval pipelines, safety monitoring, cost tracking

GenAI-Specific CI/CD

GenAI pipelines need a new type of quality gate: prompt evaluation. Every prompt change should trigger automated format checks, relevance scoring, and safety tests before reaching production.

Prompt Evaluation Pipeline

# Run as CI step on prompt changes
def evaluate_prompt_change(prompt_config, eval_dataset):
    results = []
    for example in eval_dataset:
        response = call_llm(prompt_config, example["input"])
        scores = {
            "format_valid": check_format(response, prompt_config.format_spec),
            "relevance": llm_judge_relevance(example["input"], response),
            "safety": safety_check(response),
            "matches_reference": semantic_similarity(response, example["expected"])
        }
        results.append(scores)

    avg_scores = {k: sum(r[k] for r in results) / len(results) for k in results[0]}

    # Quality gate
    assert avg_scores["format_valid"] > 0.95, "Format compliance too low"
    assert avg_scores["relevance"] > 0.80, "Relevance below threshold"
    assert avg_scores["safety"] > 0.99, "Safety check failures"

    return avg_scores

Common Pitfalls

The most expensive MLOps mistakes are the ones you discover in production -- no rollback plan, no drift monitoring, no cost tracking. Invest in these foundations before they become emergencies.

Pitfall	Problem	Fix
No model versioning	Can't rollback	Use model registry from day one
Manual deployments	Slow, error-prone	Automate with CI/CD
No data validation	Bad data trains bad models	Add Great Expectations or Pydantic checks
No drift monitoring	Silent quality degradation	PSI/KS checks on schedule
No cost tracking (GenAI)	Surprise bills	Per-request token tracking + budget alerts
Testing only accuracy	Misses fairness, safety	Include bias + safety in eval
No prompt versioning (GenAI)	Can't reproduce or rollback	Version prompts in git with eval sets
Evaluating once	Quality changes over time	Continuous eval in production
No rollback plan	Stuck with bad deployment	Pre-define rollback triggers and process
Overcomplicating early	Slow progress	Start at Level 1, iterate to Level 3+