Module 10 of 17

Evaluation Strategies for LLM Systems

A practitioner's guide to evaluating LLM applications — what to measure, LLM-as-judge patterns, RAGAS for RAG pipelines, regression testing, A/B testing, evaluation pipeline design, and when to invest in human eval. Decision frameworks for building eval infrastructure that earns trust.

12 min readOpen in Colab

01. Why Evaluation Is Hard -- and Why It Matters

In traditional ML, evaluation is conceptually simple: you have a test set with known labels, your model makes predictions, you count how often it is right. LLMs break every part of this model. When you ask an LLM to "summarize this document" or "explain quantum entanglement to a ten-year-old," there is no single correct answer. There are dozens of equally valid answers varying in length, style, focus, and vocabulary. Two expert humans would disagree about whether a given response is excellent or merely adequate.

The practical consequence is that many teams fall back on the "vibe check": a developer tries the system a few times, it seems to work, they ship it. The vibe check does not scale -- not to the volume of production outputs, not to the diversity of real user inputs, not to detecting subtle regressions when you change a prompt or swap a model. The vibe check is how teams ship changes that hurt 20% of their users while improving the 5 examples the developer happened to test.

Compounding this is prompt sensitivity: adding a single word to a system prompt, changing temperature by 0.1, or switching model versions can produce noticeably different outputs. Every change is potentially a regression you cannot see without a rigorous eval harness. Evaluation is not a one-time activity -- it is continuous measurement infrastructure.

Think of it like this: Evaluating an LLM application without an eval harness is like driving without a speedometer. You might feel like you are going the right speed, but you have no way to know if a small change to the engine made you faster or slower. The eval harness is the dashboard that makes every decision data-driven.

What This Means for Practitioners

The evaluation cost-accuracy tradeoff drives every decision:

MethodCost per SampleSpeedAccuracyUse For
Automated metrics (BLEU, ROUGE, BERTScore)<$0.001InstantLow-MediumEvery CI run, format/length checks
LLM-as-judge$0.01-0.10SecondsMedium-HighDetailed quality on sampled traffic
Human evaluation$5-50Hours-DaysHighestPeriodic calibration, new domains

Build your eval stack in layers: cheap automated metrics for every run, LLM-as-judge for detailed analysis, human eval for periodic calibration. No single layer is sufficient alone.

What to measure depends on your application type:

Application TypePrimary MetricsSecondary Metrics
RAG / knowledge Q&AFaithfulness, context precision, answer relevanceRetrieval recall, citation accuracy
Chatbot / assistantHelpfulness, tone, safetySession completion rate, escalation rate
Code generationpass@1 (unit tests pass), correctnessCode quality, security, efficiency
SummarizationFaithfulness, completeness, concisenessROUGE-L (as sanity check only)
Classification / extractionAccuracy, F1, exact matchLatency, cost per classification

02. LLM-as-Judge

The LLM-as-judge pattern uses a powerful model (GPT-4o, Claude Opus) to evaluate the output of another model. This might sound circular, but evaluating a response is a much easier task than generating it -- just as a human editor can spot errors in an essay they could not have written themselves. GPT-4 as a judge achieves approximately 80% agreement with human raters on many benchmarks, comparable to agreement between two human annotators.

The practical value is evaluation at production scale. If your application handles 10,000 queries per day, you cannot have a human review each response. But you can run an automated judge on a representative sample of 500 queries and get a reliable quality signal every time you change a prompt.

What This Means for Practitioners

Two evaluation modes, each with a purpose:

ModeHow It WorksBest ForTradeoff
PointwiseJudge scores a single response on 1-5 scale against a rubricScalable quality monitoringRequires well-calibrated rubric
PairwiseJudge picks the better of two responsesModel comparison, A/B analysisMore reliable but N-squared evaluations

Known biases and how to mitigate them:

BiasEffectMitigation
Position biasPrefers whichever answer appears first (15-25% effect)Run pairwise twice with swapped positions, average
Verbosity biasRates longer responses higher (10-20% inflation)Rubric explicitly penalizes unnecessary length
Self-enhancementModel rates its own style higherUse a different model family as judge
SycophancyPrefers responses that agree with prompt claimsNeutral framing in judge prompt

Full LLM-as-judge implementation:

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

client = OpenAI()

class JudgeVerdict(BaseModel):
    score: int = Field(..., ge=1, le=5)
    reasoning: str
    strengths: list[str] = Field(default_factory=list)
    weaknesses: list[str] = Field(default_factory=list)
    verdict: Literal["pass", "fail"]

JUDGE_SYSTEM_PROMPT = """You are an expert evaluator of AI assistant responses.
Score quality on a 5-point scale.

RUBRIC:
5 - Excellent: Fully correct, well-organized, no errors
4 - Good: Mostly correct with minor gaps
3 - Adequate: Correct core answer but missing important nuance
2 - Poor: Partially correct with significant errors
1 - Very poor: Incorrect or fails to address the question

RULES:
- Score based on accuracy and helpfulness, NOT length
- Write reasoning BEFORE assigning the score
- Provide specific evidence from the response"""

def judge_pointwise(question: str, answer: str,
                    reference: str | None = None,
                    judge_model: str = "gpt-4o") -> JudgeVerdict:

    user_content = f"QUESTION: {question}\n\nANSWER TO EVALUATE:\n{answer}"
    if reference:
        user_content += f"\n\nREFERENCE (guidance, not required match):\n{reference}"

    response = client.beta.chat.completions.parse(
        model=judge_model,
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
            {"role": "user", "content": user_content}
        ],
        response_format=JudgeVerdict,
        temperature=0.0
    )
    return response.choices[0].message.parsed

def batch_evaluate(test_cases: list[dict]) -> dict:
    results = []
    for case in test_cases:
        verdict = judge_pointwise(
            question=case["question"], answer=case["answer"],
            reference=case.get("reference")
        )
        results.append({**case, "verdict": verdict.model_dump()})

    scores = [r["verdict"]["score"] for r in results]
    pass_rate = sum(1 for r in results if r["verdict"]["verdict"] == "pass") / len(results)
    return {
        "results": results,
        "summary": {
            "mean_score": sum(scores) / len(scores),
            "pass_rate": pass_rate,
            "score_distribution": {i: scores.count(i) for i in range(1, 6)}
        }
    }

Always log the judge's full reasoning trace, not just the score. Reasoning traces reveal systematic patterns -- "the judge always penalizes responses that don't start with a direct answer" -- that you can address in your judge prompt.

03. RAGAS for RAG Evaluation

Diagram 1

RAGAS decomposes RAG evaluation into four targeted metrics, each diagnosing a different failure mode in the retrieval-generation pipeline.

RAG systems have a unique evaluation challenge: there are multiple independent points of failure. The retrieval step might find the wrong documents. The generation step might contradict the retrieved context. The response might answer a different question. A single quality score cannot diagnose which of these failures is occurring -- you need separate metrics for separate pipeline stages.

RAGAS (Retrieval Augmented Generation Assessment) defines four complementary metrics that give you a complete diagnostic picture. Teams that start using RAGAS consistently report it reveals problems they had no idea existed -- particularly faithfulness failures where the LLM confidently generates information not present in the retrieved context.

What This Means for Practitioners

The four RAGAS metrics and what each diagnoses:

MetricWhat It MeasuresHow It WorksFailure It Catches
Faithfulness (0-1)Are all claims supported by retrieved context?Decompose answer into claims, check each against contextHallucination -- the most dangerous RAG failure
Answer Relevance (0-1)Does the answer address the question asked?Generate questions the answer would be good for, compare to originalTopic drift -- answering a different question
Context Precision (0-1)Are retrieved chunks actually relevant?Judge classifies each chunk as relevant or notNoisy retrieval diluting context
Context Recall (0-1)Did retrieval find everything needed?Check if reference answer claims are covered by contextMissing documents or bad chunking

Interpreting RAGAS scores for action:

Score PatternDiagnosisFix
Low faithfulness, high context precisionLLM is hallucinating despite good retrievalStrengthen synthesis prompt, lower temperature
High faithfulness, low answer relevanceAnswer is grounded but off-topicImprove query understanding or prompt focus
Low context precisionRetriever returning irrelevant chunksAdd reranking, tune chunk size, improve embeddings
Low context recallMissing information in corpusExpand document collection, fix chunking boundaries

RAGAS implementation:

from ragas import evaluate, EvaluationDataset
from ragas.metrics import (
    Faithfulness, AnswerRelevancy, ContextPrecision,
    ContextRecall, AnswerCorrectness,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
embeddings = LangchainEmbeddingsWrapper(
    OpenAIEmbeddings(model="text-embedding-3-small"))

eval_samples = [
    {
        "user_input": "What is the refund policy for digital products?",
        "response": "Digital products are non-refundable once downloaded, "
                   "except in cases of technical failure verified by support.",
        "retrieved_contexts": [
            "Section 4.2: Digital goods are non-refundable after delivery.",
            "Section 4.5: Technical errors may qualify for refund via support.",
            "Section 1.1: Our store sells physical and digital products.",
        ],
        "reference": "Digital products are non-refundable after download. "
                  "Technical failures may qualify for refund via support."
    },
]

dataset = EvaluationDataset.from_list(eval_samples)
metrics = [
    Faithfulness(llm=llm),
    AnswerRelevancy(llm=llm, embeddings=embeddings),
    ContextPrecision(llm=llm),
    ContextRecall(llm=llm),
]

results = evaluate(dataset=dataset, metrics=metrics)
df = results.to_pandas()
print(df[["user_input", "faithfulness", "answer_relevancy",
          "context_precision", "context_recall"]])

Use RAGAS TestsetGenerator to bootstrap your eval set. It creates diverse question types (factual, multi-hop, abstractive) from your actual documents. A 50-100 question synthetic test set gets you started immediately, even before collecting real user queries.

04. Regression Testing & Evaluation-Driven Development

Diagram 2

The EDD loop runs continuously throughout the lifetime of an LLM application. Stages 4-5-6 repeat for every improvement iteration; stages 1-2-3 are revisited when the application scope changes.

Evaluation-Driven Development (EDD) is the LLM equivalent of Test-Driven Development. The core principle: define success criteria before you build, then build toward passing those criteria, and protect progress with automated regression tests on every change. Without an eval harness, every change requires manual spot-checking. With one, changes produce an objective score you can compare to the previous version.

Think of it like this: EDD is like having a grading rubric before writing an essay. You know what "A" looks like before you start writing, and you can check your work against the rubric at any point. Without the rubric, you are guessing whether your revisions made things better or worse.

What This Means for Practitioners

The EDD cycle in six stages:

StageActivityKey Output
1. Define criteriaWrite measurable success definitions for each capabilityRubric with pass/fail thresholds
2. Build eval dataset50-200 examples: golden cases + edge cases + adversarialTest suite in version control
3. Implement baselineSimplest reasonable implementationBaseline scores (the floor)
4. MeasureRun full eval suite, inspect per-category scoresFailure analysis report
5. ImproveOne targeted change at a timeScore delta per change
6. Measure againCompare to previous, check for regressionsGo/no-go deployment decision

Critical rules for reliable evaluation:

  • One change at a time. Changing multiple things simultaneously makes it impossible to attribute score changes.
  • Separate development and test sets. Never evaluate only on examples you used to develop your system. Maintain a held-out test set you never inspect during iteration.
  • Track per-category scores, not just averages. A system scoring 80% overall but 40% on edge cases has a very different failure profile than 80% uniform.

Building a golden dataset:

Start by collecting 100 real queries from production logs or beta users. Have subject-matter experts write reference answers. Label difficulty, topic, and required reasoning type. This dataset becomes your primary regression suite -- every model change, prompt change, or retrieval change must pass it before deployment.

CI/CD integration: Modern LLM development treats eval runs like test suites. They run on every pull request, block merges if regressions are detected, and produce dashboards tracking trends. Platforms like LangSmith, Braintrust, and Weights & Biases provide this infrastructure.

05. A/B Testing in Production

Offline evaluation tells you whether a change is better on your test set. A/B testing tells you whether it is better for real users doing real tasks. The two are complementary: offline eval gates what gets deployed to the A/B test, and A/B results calibrate which offline metrics actually predict user satisfaction.

What This Means for Practitioners

When to use offline eval vs. A/B testing:

ScenarioOffline EvalA/B Test
Prompt wording changeYes (fast iteration)Only if offline results are ambiguous
Model version swapYes (regression check)Yes (measure real user impact)
New feature or capabilityYes (baseline measurement)Yes (measure engagement, completion)
Retrieval parameter tuningYes (measure retrieval metrics)Rarely needed
Major architecture changeYes (comprehensive check)Yes (mandatory before full rollout)

A/B testing requirements:

  • Serve two variants to random user segments (or split by user ID for consistency)
  • Log all interactions for both variants
  • Sample a subset for LLM-as-judge scoring or human review
  • Require statistical significance (several hundred samples per variant per query category) before declaring a winner
  • Run long enough to cover edge cases and different user types

06. Human Evaluation

Every automated metric derives its authority from correlation with human judgment. This means that however sophisticated your automated eval pipeline becomes, you need periodic human evaluation to calibrate it and catch systematic failures that automated metrics miss.

What This Means for Practitioners

When to invest in human eval:

TriggerWhy
Major version releaseValidate that automated metrics still correlate with quality
Automated metrics diverge from user satisfaction signalsYour metrics may be measuring the wrong thing
Entering a new domain or user segmentExisting rubrics may not apply
After red-teaming discovers new failure modesAdd discovered failures to regression suite

Annotation rubric design rules:

  • Decompose into independent dimensions (accuracy, completeness, tone, safety). Single holistic ratings are ambiguous and produce noisy data.
  • Include anchor examples for each rating level. Abstract descriptions of "3 out of 5" are ambiguous; concrete examples anchor the interpretation.
  • Calibrate across annotators before the main run. Have all annotators rate the same 20-30 examples, compare, and resolve disagreements.
  • Report inter-annotator agreement (Cohen's kappa for 2 annotators, Fleiss' kappa for 3+). Below 0.4 is a warning sign that your rubric is ambiguous.

Red-teaming is structured adversarial testing. A group explicitly tries to make your system fail. The output is specific failure cases that get added to your regression suite so those failure modes are permanently monitored. Effective red-teaming requires diversity in the team -- different backgrounds and attack strategies find different failure modes.

Shadow mode evaluation logs all production traffic and periodically samples a subset for human review. Stratified sampling is more useful than random sampling: oversample low-confidence retrievals, long responses, and recent queries to catch distribution shift early.

Start simple. Even a spreadsheet where team members tag 10 production responses per week as "good," "acceptable," or "needs improvement" with one-sentence explanations accumulates into a valuable dataset. When you build a proper annotation system, that backlog provides calibration data.

Interview Ready

How to Explain This in 2 Minutes

Elevator Pitch: LLM evaluation is fundamentally harder than traditional ML evaluation because there is no single correct answer for open-ended generation. A production-grade eval strategy layers three approaches: cheap automated metrics for every CI run, LLM-as-judge for nuanced quality at scale, and periodic human evaluation to calibrate everything else. For RAG systems, RAGAS decomposes evaluation into four independent metrics -- faithfulness, answer relevance, context precision, and context recall -- so you can pinpoint exactly which pipeline stage is failing. The teams that ship reliable LLM products treat evaluation as continuous infrastructure, not a one-time checklist.

Likely Interview Questions

QuestionWhat They're Really Asking
How do you evaluate an LLM application in production?Do you understand the layered eval stack?
How does LLM-as-judge work and what are its failure modes?Can you identify and mitigate position, verbosity, and self-enhancement bias?
How would you detect hallucinations in a RAG system?Can you connect faithfulness metrics and claim decomposition into a practical pipeline?
When would you use A/B testing versus offline evaluation?Do you understand when each is warranted and how they complement each other?

Common Mistakes

  • Relying solely on BLEU/ROUGE for open-ended generation. These surface-level n-gram metrics penalize valid paraphrases and correlate poorly with human judgment for creative or reasoning tasks.
  • Using the same model as both generator and judge. Self-enhancement bias means GPT-4 systematically rates GPT-4-style outputs higher. Use a different model family for judging.
  • Evaluating only on development examples. This is the LLM equivalent of training on test data. A held-out set provides the only unbiased estimate of real-world performance.

Previous Module

09 · Agents

Next Module

11 · Guardrails

Quality Phase, Safety