Quick Reference 05

Fine-Tuning LLMs

Quick reference for when to fine-tune, data prep, LoRA/QLoRA, training parameters, and evaluation.

7 min readAI EngineeringQuick ReferenceDownload PDF

When to Fine-Tune vs Alternatives

Fine-tuning is expensive and slow compared to prompting or RAG -- but when you genuinely need to change model behavior, style, or domain expertise, nothing else works. Use this table to avoid fine-tuning when a simpler approach will do.

ApproachBest WhenCostEffortLatency Impact
Prompt engineeringTask is well-defined, few formatsVery lowLowNone
Few-shot examplesNeed consistent format/styleLowLowSlightly higher (more tokens)
RAGNeed up-to-date or private knowledgeMediumMediumHigher (retrieval step)
Fine-tuningNeed behavior change, style, or domain expertiseHighHighLower at inference
Pre-training (continued)Entirely new domain/languageVery highVery highNone

Decision Tree

Can prompt engineering solve it?
  YES -> Stop, use prompts
  NO  -> Is the issue missing knowledge?
           YES -> Use RAG
           NO  -> Is it about style, format, or behavior?
                    YES -> Fine-tune
                    NO  -> Is the base model too slow/expensive?
                             YES -> Fine-tune smaller model
                             NO  -> Revisit prompt engineering

Data Preparation

Data quality is the single biggest predictor of fine-tuning success -- a model trained on 200 clean, diverse examples will outperform one trained on 2,000 noisy ones. Get the format and quality right before you spend a dollar on compute.

JSONL Format (OpenAI Chat)

{"messages": [{"role": "system", "content": "You are a legal assistant."}, {"role": "user", "content": "Summarize this clause..."}, {"role": "assistant", "content": "This clause states that..."}]}
{"messages": [{"role": "system", "content": "You are a legal assistant."}, {"role": "user", "content": "Is this enforceable?"}, {"role": "assistant", "content": "Based on the terms..."}]}

JSONL Format (Anthropic)

{"messages": [{"role": "user", "content": "Summarize this clause..."}, {"role": "assistant", "content": "This clause states that..."}]}

JSONL Format (Alpaca-style, for open models)

{"instruction": "Summarize the following clause", "input": "The party of the first part...", "output": "This clause establishes that..."}

Data Quality Checklist

CheckWhyHow
Minimum 50-100 examplesModels need enough signalCount rows
Consistent formatReduces noiseValidate JSON schema
Diverse inputsPrevents overfittingCluster by topic, check distribution
Clean outputsGarbage in = garbage outHuman review sample
No contradictionsConfuses the modelCross-check examples
Balanced classesPrevents bias toward majorityCheck class distribution
DeduplicationOverfitting on duplicatesHash-based dedup
PII removedPrivacy complianceRun PII scanner
Task TypeMinimumGoodExcellent
Classification505005,000+
Structured extraction1001,00010,000+
Style/tone transfer2001,0005,000+
Domain expertise5005,00050,000+
Code generation1,00010,000100,000+

Fine-Tuning Methods

Full fine-tuning is rarely necessary -- LoRA and QLoRA achieve 90-95% of full fine-tuning quality at a fraction of the memory and cost. Choose your method based on your GPU budget and quality requirements.

MethodParameters TrainedMemory RequiredTraining SpeedQuality
Full fine-tuningAllVery high (4x model)SlowHighest
LoRA0.1-1% of paramsLow (model + adapters)FastHigh
QLoRA0.1-1% of paramsVery low (4-bit base)FastHigh
Prefix tuningPrefix embeddings onlyVery lowVery fastMedium
Prompt tuningSoft prompt tokensMinimalFastestLower
RLHFReward model + policyVery highVery slowHighest for alignment
DPOPolicy only (no reward model)HighMediumHigh for alignment

LoRA Key Concepts

Original weight matrix: W (d x d)
LoRA decomposition:    W' = W + BA
  where B: (d x r), A: (r x d), r << d

- r (rank): Typically 4-64. Higher = more capacity, more memory
- alpha: Scaling factor, typically 2x rank
- Target modules: q_proj, v_proj (minimum), k_proj, o_proj, up_proj, down_proj (full)

LoRA Configuration Example

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,                          # rank
    lora_alpha=32,                 # scaling (usually 2*r)
    target_modules=[               # which layers to adapt
        "q_proj", "v_proj",
        "k_proj", "o_proj",
        "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable: ~0.5% of total parameters

QLoRA Setup

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto"
)
# Then apply LoRA on top of 4-bit model

Training Parameters

Wrong hyperparameters waste compute and produce bad models. Learning rate and epoch count are the two most impactful settings -- get those right first, then tune the rest.

ParameterTypical RangeNotes
Learning rate1e-5 to 5e-4Lower for larger models
Batch size4-32Limited by GPU memory; use gradient accumulation
Epochs1-5More data = fewer epochs needed
Warmup steps5-10% of total stepsStabilizes early training
Weight decay0.01-0.1Regularization
Max sequence length512-4096Match your data; padding wastes compute
Gradient accumulation2-16Simulates larger batch size
LR schedulerCosine or linear decayCosine usually better

Training with Hugging Face

from transformers import TrainingArguments, Trainer
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    weight_decay=0.01,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    bf16=True,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    args=training_args,
    max_seq_length=2048,
)
trainer.train()

GPU Memory Estimation

Running out of GPU memory mid-training is the most common fine-tuning blocker. Use this table to right-size your hardware before you start, and pick QLoRA when your GPU budget is tight.

Model SizeFull FTLoRA (r=16)QLoRA (4-bit)
1B~8 GB~5 GB~3 GB
7B~56 GB~18 GB~8 GB
13B~104 GB~32 GB~14 GB
70B~560 GB~160 GB~48 GB

Rule of thumb: Full fine-tuning needs ~4x model size in memory (model + gradients + optimizer states).

OpenAI Fine-Tuning (API)

If you do not need to self-host, API-based fine-tuning is the fastest path to a custom model. You upload data, kick off a job, and get a model endpoint back -- no GPU management required.

# 1. Upload training file
file = client.files.create(
    file=open("training.jsonl", "rb"),
    purpose="fine-tune"
)

# 2. Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,
        "batch_size": "auto",
        "learning_rate_multiplier": "auto"
    }
)

# 3. Check status
client.fine_tuning.jobs.retrieve(job.id)

# 4. Use fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:org:name:id",
    messages=[...]
)

Evaluation

A fine-tuned model that looks great on training loss can be catastrophically overfit. Always evaluate on held-out data using task-specific metrics, and compare against the base model to confirm the fine-tuning actually helped.

MetricWhat It MeasuresWhen to Use
Loss (train/val)Model fittingAlways (check for overfitting)
Exact matchPerfect output matchClassification, extraction
BLEU / ROUGEText overlapSummarization, translation
BERTScoreSemantic similarityOpen-ended generation
Human eval (side-by-side)Overall qualityBefore production deployment
Task-specific accuracyDomain performanceAlways

Overfitting Signals

SignalWhat It Looks LikeFix
Train loss drops, val loss risesClassic overfit curveReduce epochs, add data
Val loss flat, train loss droppingMemorization startingEarly stopping
Perfect train accuracyModel memorized dataMore data, regularization
Worse than base model on general tasksCatastrophic forgettingLower LR, fewer epochs, LoRA

Cost Comparison

Fine-tuning costs vary by orders of magnitude depending on whether you use an API provider or self-host. Factor in both training cost and ongoing inference cost when making your decision.

ProviderModelTraining CostInference Cost
OpenAIGPT-4o-mini FT$3.00/1M train tokens$0.60/$2.40 per 1M in/out
OpenAIGPT-4o FT$25.00/1M train tokens$3.75/$15.00 per 1M in/out
Together.aiLlama 8B~$0.50/hr (1x A100)$0.20/1M tokens
Self-hostedAny open modelGPU cost onlyGPU cost only

Common Pitfalls

The most expensive mistake is fine-tuning when you should have used RAG, and the second most expensive is overfitting because you skipped the eval set. This list covers both and everything in between.

PitfallProblemFix
Fine-tuning for knowledgeLLMs forget; hallucination riskUse RAG for knowledge, fine-tune for behavior
Too few examplesUnderfitting, poor generalizationStart with 200+ high-quality examples
Too many epochsOverfittingMonitor val loss, use early stopping
Inconsistent data formatConfuses the modelStandardize all examples
No eval setCan't detect overfittingHold out 10-20% of data
Ignoring base model capabilitiesRedundant trainingTest base model + prompt first
Not merging LoRA for productionExtra inference overheadMerge adapters into base weights
Training on generated data onlyModel amplifies errorsInclude human-validated examples