1. Architecture Overview
The Fine-Tuning & Serving architecture enables you to adapt a pre-trained foundation model to your specific domain, style, or task. Instead of relying solely on prompt engineering, you modify the model's weights using your own curated dataset — then deploy the customized model behind a high-performance serving layer with A/B testing and drift monitoring.
When to Use
- Prompt engineering and few-shot examples have plateaued in quality
- You need consistent style, tone, or format that is hard to maintain via prompts alone
- Domain-specific terminology or knowledge requires weight-level adaptation
- You want to reduce inference cost by using a smaller, specialized model
- Latency requirements demand a smaller model that still meets quality thresholds
Decision Guide: Fine-Tune vs. Prompt Engineer
| Signal | Prompt Engineering | Fine-Tuning |
|---|
| Data available | < 50 examples | 500+ high-quality examples |
| Task complexity | Can be described in natural language | Requires pattern learning |
| Output consistency | Acceptable variation | Must follow rigid format |
| Iteration speed | Minutes (prompt edits) | Hours to days (training runs) |
| Cost at scale | Higher (long prompts) | Lower (shorter prompts, smaller model) |
| Maintenance | Version control prompts | Retrain on new data periodically |
Tip: Always start with prompt engineering. Only move to fine-tuning when you have strong evidence that prompts cannot achieve the required quality, and you have a robust evaluation pipeline to measure improvement.
2. Architecture Diagram

Architecture diagram — Fine-Tuning & Serving: data preparation through deployment with drift-driven retraining loop
3. Components Deep Dive
| Component | Description |
|---|
| 🗃 Data Preparation | Convert raw data into JSONL instruction/response pairs. Apply quality filtering, deduplication, length balancing, and train/validation splitting. Data quality is the single largest factor in fine-tuning success. |
| ⚙ Fine-Tuning Methods | Choose between full fine-tuning (all weights), LoRA (low-rank adapters on attention layers), QLoRA (quantized LoRA for lower memory), or prefix tuning. LoRA is the default choice for most use cases. |
| 📊 Evaluation Pipeline | Combine automated metrics (loss, perplexity, BLEU/ROUGE) with held-out test sets and human evaluation. A model that scores well on metrics but fails human review is not ready for production. |
| 🚀 Serving Infrastructure | Deploy with optimized inference engines: vLLM (PagedAttention, continuous batching), TGI (HuggingFace), or Triton (NVIDIA). Each offers different trade-offs in throughput, latency, and hardware support. |
| ⚖ A/B Testing | Route traffic between model versions using weighted splits. Compare quality metrics, latency, and user satisfaction scores. Gradually ramp new models from 5% to 100% as confidence grows. |
| 📐 Model Registry | Version every model artifact with metadata: training config, dataset hash, evaluation scores, and lineage. Enables instant rollback and reproducibility. Use MLflow, Weights & Biases, or cloud-native registries. |
Fine-Tuning Methods Comparison
| Method | Trainable Params | GPU Memory | Quality | Best For |
|---|
| Full Fine-Tuning | 100% | Very High (4x model size) | Highest | Unlimited compute, maximum quality |
| LoRA | 0.1 – 1% | Low (1.1x model size) | Near-full | Most production use cases |
| QLoRA | 0.1 – 1% | Very Low (0.5x) | Good | Limited GPU memory, prototyping |
| Prefix Tuning | < 0.1% | Minimal | Moderate | Simple style/format adaptation |
Key Hyperparameters
| Parameter | Typical Range | Notes |
|---|
| Learning Rate | 1e-5 – 2e-4 | Lower for larger models; use cosine scheduler |
| Epochs | 1 – 5 | More epochs risk overfitting; monitor val loss |
| LoRA Rank (r) | 4 – 64 | Higher rank = more capacity but more params |
| LoRA Alpha | 16 – 128 | Usually 2x rank; controls scaling |
| Batch Size | 4 – 32 | Use gradient accumulation if GPU-limited |
| Warmup Ratio | 0.03 – 0.1 | Gradual learning rate increase |
4. Implementation
Data Preparation Pipeline
import json
import hashlib
from pathlib import Path
def prepare_dataset(raw_path: str, output_path: str, max_len: int = 2048):
"""Convert raw data to JSONL, filter, and deduplicate."""
seen_hashes = set()
valid, skipped = 0, 0
with open(raw_path) as f_in, open(output_path, "w") as f_out:
for line in f_in:
row = json.loads(line)
# Validate required fields
if not row.get("instruction") or not row.get("response"):
skipped += 1
continue
# Length filter
total_len = len(row["instruction"]) + len(row["response"])
if total_len > max_len or total_len < 20:
skipped += 1
continue
# Deduplicate by content hash
content_hash = hashlib.md5(
(row["instruction"] + row["response"]).encode()
).hexdigest()
if content_hash in seen_hashes:
skipped += 1
continue
seen_hashes.add(content_hash)
# Format as chat messages
formatted = {
"messages": [
{"role": "user", "content": row["instruction"]},
{"role": "assistant", "content": row["response"]},
]
}
f_out.write(json.dumps(formatted) + "\n")
valid += 1
print(f"Prepared {valid} examples, skipped {skipped}")
return valid
LoRA Training Configuration
from peft import LoraConfig, get_peft_model, TaskType
from transformers import (
AutoModelForCausalLM, AutoTokenizer,
TrainingArguments, Trainer
)
# Load base model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="bfloat16",
device_map="auto",
)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank
lora_alpha=32, # scaling factor
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,553,600 || all params: 8,030,261,248 || 0.08%
# Training arguments
training_args = TrainingArguments(
output_dir="./ft-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
bf16=True,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
report_to="wandb",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
)
trainer.train()
vLLM Serving Setup
# Launch vLLM server with LoRA adapter
# Command line:
# python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-3.1-8B-Instruct \
# --enable-lora \
# --lora-modules my-adapter=./ft-output/adapter \
# --max-loras 4 \
# --port 8000
from openai import OpenAI
# Client code (vLLM is OpenAI-compatible)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="na")
def query_fine_tuned(prompt: str, model: str = "my-adapter") -> str:
"""Query the fine-tuned model via vLLM."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=512,
temperature=0.3,
)
return response.choices[0].message.content
# A/B traffic routing
import random
def ab_route(prompt: str, new_model_pct: float = 0.1) -> str:
"""Route traffic between model versions."""
model = "my-adapter-v2" if random.random() < new_model_pct else "my-adapter-v1"
result = query_fine_tuned(prompt, model=model)
# Log which model served the request for analysis
log_ab_result(prompt, model, result)
return result
5. Data Flow
Here is the step-by-step flow through the Fine-Tuning & Serving pipeline:

| Step | Action | Details |
|---|
| 1 | Collect training data | Gather instruction/response pairs from production logs, human annotators, or synthetic generation |
| 2 | Prepare & validate | Convert to JSONL, filter by quality/length, deduplicate, split into train/validation/test sets (80/10/10) |
| 3 | Configure training | Select base model, LoRA rank, learning rate, epochs; set up experiment tracking (W&B, MLflow) |
| 4 | Fine-tune | Train with LoRA adapters; monitor training/validation loss for convergence and overfitting |
| 5 | Evaluate | Run held-out test set, compute automated metrics, conduct human evaluation on 50-100 sample outputs |
| 6 | Register model | Push adapter weights + metadata (dataset version, scores, config) to model registry |
| 7 | Deploy with A/B split | Start new model at 5-10% traffic; compare quality and latency against production baseline |
| 8 | Monitor & iterate | Track quality drift, latency p99, error rates; trigger retraining when metrics degrade |
6. Trade-offs & Considerations
| Advantage | Limitation |
|---|
| Domain-specific quality improvements | Requires curated, high-quality training data |
| Reduced inference cost (smaller model, shorter prompts) | Training compute and GPU costs |
| Consistent output style and format | Risk of catastrophic forgetting on general tasks |
| Lower latency with smaller specialized models | Ongoing maintenance: retraining on new data |
| Intellectual property stays in your adapter weights | Harder to debug than prompt-based approaches |
Serving Infrastructure Comparison
| Engine | Strengths | Best For |
|---|
| vLLM | PagedAttention, continuous batching, multi-LoRA | High-throughput production serving |
| TGI (HuggingFace) | Easy setup, HF ecosystem integration | Quick deployment, HF model hub |
| Triton (NVIDIA) | Multi-framework, ensemble pipelines | Complex ML pipelines, NVIDIA GPUs |
When to upgrade: If you need multiple specialized models collaborating on complex tasks, move to Architecture 09 (Multi-Agent). If you need a full platform with model management, routing, and observability, see Architecture 10 (Production Platform).
7. Production Checklist
- Data quality pipeline: automated filtering, deduplication, and validation on every training run
- Dataset versioning with hash-based tracking (DVC, LakeFS, or cloud storage versioning)
- Experiment tracking: log all hyperparameters, metrics, and artifacts (W&B, MLflow)
- Evaluation gate: model must pass automated + human eval thresholds before deployment
- Model registry with rollback capability (tag: production, staging, deprecated)
- A/B testing framework with statistical significance checks before full rollout
- Serving infrastructure with auto-scaling, health checks, and graceful draining
- Quality drift monitoring: periodic evaluation on held-out set, alert on regression
- Cost tracking per training run and per-inference cost comparison across model versions
- Automated retraining pipeline triggered by drift alerts or scheduled cadence
- Security: model weights encrypted at rest, access-controlled registry, audit logs
- Disaster recovery: model artifacts backed up, serving can cold-start from registry