Quick Reference 06

LLM Evaluation & Guardrails

Quick reference for LLM evaluation metrics, guardrail types, and safety patterns for input and output.

7 min readAI SafetyQuick ReferenceDownload PDF

Evaluation Metrics Overview

Without rigorous evaluation, you are guessing whether your LLM system is improving or degrading. These metrics form the foundation of every quality measurement pipeline, from offline benchmarks to production monitoring.

Text Similarity Metrics

MetricWhat It MeasuresRangeStrengthsWeaknesses
BLEUN-gram precision overlap0-1Fast, establishedMisses semantics
ROUGE-LLongest common subsequence0-1Good for summariesSurface-level only
ROUGE-1/2Unigram/bigram overlap0-1SimpleNo semantic understanding
METEORAlignment with synonyms + stems0-1Better than BLEU for short textSlower
BERTScoreCosine similarity of BERT embeddings0-1Semantic awarenessCompute-heavy
BLEURTLearned metric (fine-tuned BERT)-1 to 1Best correlation with humansNeeds training

LLM-as-Judge

ApproachDescriptionProsCons
Pointwise scoringRate output 1-5 on criteriaSimpleScale inconsistency
Pairwise comparisonA vs B, which is better?More reliable2x cost, position bias
Reference-basedCompare to gold answerGroundedNeeds reference answers
Rubric-basedScore against detailed rubricConsistentRubric design takes effort

LLM-as-Judge Prompt Template

You are an expert evaluator. Rate the following response on a scale of 1-5
for each criterion.

Criteria:
- Relevance (1-5): Does the response address the question?
- Accuracy (1-5): Are the facts correct?
- Completeness (1-5): Does it cover all important aspects?
- Clarity (1-5): Is it well-written and easy to understand?

Question: {question}
Response: {response}
Reference Answer: {reference}

Provide scores as JSON:
{"relevance": N, "accuracy": N, "completeness": N, "clarity": N, "reasoning": "..."}

RAG-Specific Metrics

MetricWhat It MeasuresHow to Compute
FaithfulnessIs the answer grounded in context?LLM checks each claim against retrieved docs
Answer relevancyDoes the answer address the question?Semantic similarity (answer, question)
Context relevancyAre retrieved docs relevant?LLM rates relevance of each chunk
Context recallDid we retrieve enough?Compare answer against reference
Answer correctnessIs the final answer right?Compare to ground truth

Evaluation Frameworks

Do not build your evaluation infrastructure from scratch -- these frameworks handle the boilerplate of running test suites, scoring outputs, and comparing experiments so you can focus on defining what "good" means for your use case.

FrameworkKey FeatureLanguage
RAGASRAG-specific metrics (faithfulness, relevancy)Python
DeepEvalLLM-as-judge with many metricsPython
PromptfooConfig-driven eval, CI/CD friendlyNode.js
LangSmithTracing + evaluation in LangChainPython
BraintrustLogging, scoring, experimentsPython/TS
Arize PhoenixTracing, evals, embeddings vizPython

Guardrail Types

Guardrails are the safety net between your LLM and your users. Each type trades off speed, cost, and accuracy -- production systems layer multiple types together for defense in depth.

TypeSpeedCostAccuracyUse Case
Rule-based (regex, keyword)Very fastFreeLow-mediumKnown bad patterns
Classifier (ML model)FastLowMedium-highTopic/toxicity detection
LLM-basedSlowHighHighNuanced judgment
NER + PII detectionFastLowHigh for PIIData privacy
Embedding similarityFastLowMediumOff-topic detection

Input Guardrails

GuardrailDetectsImplementation
Prompt injectionAttempts to override system promptClassifier + LLM check
JailbreakBypass attemptsPattern matching + classifier
PII detectionSSN, email, phone, namesRegex + NER model (Presidio)
Topic restrictionOff-topic requestsClassifier or embedding distance
Input lengthExtremely long inputsToken counter + hard limit
Language detectionUnsupported languagesLanguage ID model
Rate limitingAbuse, DDoSToken bucket per user

Output Guardrails

GuardrailDetectsImplementation
Hallucination checkUnsupported claimsCompare output vs source docs
Toxicity filterHarmful contentClassifier (Perspective API, etc.)
PII leakageModel outputs PIIRegex + NER on output
Format validationWrong output structureJSON schema validation
Factual groundingUnverifiable claimsCitation check, source matching
Brand safetyOff-brand contentKeyword list + classifier
Code safetyDangerous code patternsAST analysis, sandbox

Guardrail Implementation Patterns

A single guardrail layer will always have blind spots. The layered defense pattern catches cheap, obvious problems first and reserves expensive LLM-based checks for what slips through.

Layered Defense

Input
  │
  ├── Layer 1: Rule-based (fast, cheap)
  │   ├── Length check
  │   ├── Keyword blocklist
  │   └── Regex patterns
  │
  ├── Layer 2: Classifier (fast, moderate cost)
  │   ├── Toxicity classifier
  │   ├── Topic classifier
  │   └── PII NER
  │
  ├── Layer 3: LLM-based (slow, expensive)
  │   ├── Injection detection
  │   └── Nuanced policy check
  │
  ▼
LLM Call
  │
  ▼
Output
  │
  ├── Layer 1: Format validation (fast)
  ├── Layer 2: PII/toxicity scan (fast)
  └── Layer 3: Faithfulness check (if RAG)

Prompt Injection Detection

# Simple pattern-based check
INJECTION_PATTERNS = [
    r"ignore (?:all )?(?:previous|above|prior) instructions",
    r"you are now",
    r"new instructions:",
    r"system prompt:",
    r"forget (?:everything|your instructions)",
    r"disregard (?:all|your|the)",
]

def check_injection(text: str) -> bool:
    text_lower = text.lower()
    return any(re.search(p, text_lower) for p in INJECTION_PATTERNS)

PII Detection with Presidio

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
results = analyzer.analyze(
    text="My SSN is 123-45-6789 and email is john@example.com",
    language="en",
    entities=["PHONE_NUMBER", "EMAIL_ADDRESS", "US_SSN", "PERSON"]
)
# Returns: [type=US_SSN, start=10, end=21, score=0.85, ...]

Output Validation

import json
from jsonschema import validate, ValidationError

expected_schema = {
    "type": "object",
    "properties": {
        "answer": {"type": "string"},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "sources": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["answer", "confidence"]
}

def validate_output(llm_output: str) -> dict:
    try:
        parsed = json.loads(llm_output)
        validate(instance=parsed, schema=expected_schema)
        return {"valid": True, "data": parsed}
    except (json.JSONDecodeError, ValidationError) as e:
        return {"valid": False, "error": str(e)}

Guardrail Frameworks

These frameworks provide pre-built validators and scanners so you do not have to implement injection detection, PII scanning, and toxicity filtering from scratch.

FrameworkTypeKey Features
Guardrails AIPython SDKValidators, structured output, retry
NeMo GuardrailsNVIDIADialog rails, topic control, COLANG
LLM GuardOpen sourceInput/output scanners, PII, toxicity
Lakera GuardAPIPrompt injection, PII, content
RebuffOpen sourceMulti-layer injection detection

Evaluation Pipeline Design

Evaluation is not a one-time event -- it is a continuous pipeline that runs offline before deployment and online after. Without both, you are either shipping untested changes or ignoring production degradation.

Offline Evaluation

1. Curate test dataset (100+ examples with ground truth)
2. Run model on test set
3. Compute automatic metrics (BLEU, ROUGE, BERTScore)
4. Run LLM-as-judge evaluation
5. Compute pass rates for guardrails
6. Human eval on sample (20-50 examples)
7. Compare against baseline

Online Monitoring

MetricWhat to TrackAlert Threshold
Guardrail trigger rate% of requests blockedSudden spike (> 2x baseline)
User feedbackThumbs up/down ratioBelow 80% positive
LatencyP50, P95, P99P95 > 5s
Token usageAvg tokens per request> 2x expected
Error rate% of failed requests> 1%
Hallucination rate% of ungrounded claims> 10% (sample-based)

Common Pitfalls

The most dangerous pitfall is having no evaluation at all -- followed closely by evaluating the wrong thing. This list covers the mistakes that lead to false confidence in your LLM system.

PitfallProblemFix
Only using BLEU/ROUGEMisses semantic qualityAdd BERTScore + LLM-as-judge
No eval datasetCan't measure improvementsCreate before building features
Guardrails too strictHigh false positive rate, bad UXTune thresholds, add fallback responses
Guardrails too looseHarmful content leaks throughLayer multiple methods
Only testing happy pathFails on adversarial inputRed-team with injection attempts
No monitoring in prodSilent degradationLog guardrail triggers, sample outputs
LLM-as-judge position biasPrefers first/last optionRandomize order, average across positions
Eval on training dataMisleading resultsStrict train/eval split
One-time evalModel or data driftsAutomate eval in CI/CD

Quick Decision: Which Metric?

Different tasks demand different metrics -- using BLEU for open-ended QA or human eval for classification wastes time and gives misleading results. Match the metric to the task.

TaskPrimary MetricSecondary
SummarizationROUGE-LBERTScore, human eval
TranslationBLEUCOMET, human eval
ClassificationAccuracy, F1Precision, Recall
Open QALLM-as-judgeBERTScore vs reference
RAGFaithfulness, Answer RelevancyContext Relevancy, Recall
Code generationpass@k (execution)CodeBLEU
Chat/dialogHuman eval, LLM-as-judgeUser satisfaction metrics