Fine-Tuning LLMs | LegacyForward.ai

When to Fine-Tune vs Alternatives

Fine-tuning is expensive and slow compared to prompting or RAG -- but when you genuinely need to change model behavior, style, or domain expertise, nothing else works. Use this table to avoid fine-tuning when a simpler approach will do.

Approach	Best When	Cost	Effort	Latency Impact
Prompt engineering	Task is well-defined, few formats	Very low	Low	None
Few-shot examples	Need consistent format/style	Low	Low	Slightly higher (more tokens)
RAG	Need up-to-date or private knowledge	Medium	Medium	Higher (retrieval step)
Fine-tuning	Need behavior change, style, or domain expertise	High	High	Lower at inference
Pre-training (continued)	Entirely new domain/language	Very high	Very high	None

Decision Tree

Can prompt engineering solve it?
  YES -> Stop, use prompts
  NO  -> Is the issue missing knowledge?
           YES -> Use RAG
           NO  -> Is it about style, format, or behavior?
                    YES -> Fine-tune
                    NO  -> Is the base model too slow/expensive?
                             YES -> Fine-tune smaller model
                             NO  -> Revisit prompt engineering

Data Preparation

Data quality is the single biggest predictor of fine-tuning success -- a model trained on 200 clean, diverse examples will outperform one trained on 2,000 noisy ones. Get the format and quality right before you spend a dollar on compute.

JSONL Format (OpenAI Chat)

{"messages": [{"role": "system", "content": "You are a legal assistant."}, {"role": "user", "content": "Summarize this clause..."}, {"role": "assistant", "content": "This clause states that..."}]}
{"messages": [{"role": "system", "content": "You are a legal assistant."}, {"role": "user", "content": "Is this enforceable?"}, {"role": "assistant", "content": "Based on the terms..."}]}

JSONL Format (Anthropic)

{"messages": [{"role": "user", "content": "Summarize this clause..."}, {"role": "assistant", "content": "This clause states that..."}]}

JSONL Format (Alpaca-style, for open models)

{"instruction": "Summarize the following clause", "input": "The party of the first part...", "output": "This clause establishes that..."}

Data Quality Checklist

Check	Why	How
Minimum 50-100 examples	Models need enough signal	Count rows
Consistent format	Reduces noise	Validate JSON schema
Diverse inputs	Prevents overfitting	Cluster by topic, check distribution
Clean outputs	Garbage in = garbage out	Human review sample
No contradictions	Confuses the model	Cross-check examples
Balanced classes	Prevents bias toward majority	Check class distribution
Deduplication	Overfitting on duplicates	Hash-based dedup
PII removed	Privacy compliance	Run PII scanner

Recommended Dataset Sizes

Task Type	Minimum	Good	Excellent
Classification	50	500	5,000+
Structured extraction	100	1,000	10,000+
Style/tone transfer	200	1,000	5,000+
Domain expertise	500	5,000	50,000+
Code generation	1,000	10,000	100,000+

Fine-Tuning Methods

Full fine-tuning is rarely necessary -- LoRA and QLoRA achieve 90-95% of full fine-tuning quality at a fraction of the memory and cost. Choose your method based on your GPU budget and quality requirements.

Method	Parameters Trained	Memory Required	Training Speed	Quality
Full fine-tuning	All	Very high (4x model)	Slow	Highest
LoRA	0.1-1% of params	Low (model + adapters)	Fast	High
QLoRA	0.1-1% of params	Very low (4-bit base)	Fast	High
Prefix tuning	Prefix embeddings only	Very low	Very fast	Medium
Prompt tuning	Soft prompt tokens	Minimal	Fastest	Lower
RLHF	Reward model + policy	Very high	Very slow	Highest for alignment
DPO	Policy only (no reward model)	High	Medium	High for alignment

LoRA Key Concepts

Original weight matrix: W (d x d)
LoRA decomposition:    W' = W + BA
  where B: (d x r), A: (r x d), r << d

- r (rank): Typically 4-64. Higher = more capacity, more memory
- alpha: Scaling factor, typically 2x rank
- Target modules: q_proj, v_proj (minimum), k_proj, o_proj, up_proj, down_proj (full)

LoRA Configuration Example

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,                          # rank
    lora_alpha=32,                 # scaling (usually 2*r)
    target_modules=[               # which layers to adapt
        "q_proj", "v_proj",
        "k_proj", "o_proj",
        "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable: ~0.5% of total parameters

QLoRA Setup

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto"
)
# Then apply LoRA on top of 4-bit model

Training Parameters

Wrong hyperparameters waste compute and produce bad models. Learning rate and epoch count are the two most impactful settings -- get those right first, then tune the rest.

Parameter	Typical Range	Notes
Learning rate	1e-5 to 5e-4	Lower for larger models
Batch size	4-32	Limited by GPU memory; use gradient accumulation
Epochs	1-5	More data = fewer epochs needed
Warmup steps	5-10% of total steps	Stabilizes early training
Weight decay	0.01-0.1	Regularization
Max sequence length	512-4096	Match your data; padding wastes compute
Gradient accumulation	2-16	Simulates larger batch size
LR scheduler	Cosine or linear decay	Cosine usually better

Training with Hugging Face

from transformers import TrainingArguments, Trainer
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    weight_decay=0.01,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    bf16=True,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    args=training_args,
    max_seq_length=2048,
)
trainer.train()

GPU Memory Estimation

Running out of GPU memory mid-training is the most common fine-tuning blocker. Use this table to right-size your hardware before you start, and pick QLoRA when your GPU budget is tight.

Model Size	Full FT	LoRA (r=16)	QLoRA (4-bit)
1B	~8 GB	~5 GB	~3 GB
7B	~56 GB	~18 GB	~8 GB
13B	~104 GB	~32 GB	~14 GB
70B	~560 GB	~160 GB	~48 GB

Rule of thumb: Full fine-tuning needs ~4x model size in memory (model + gradients + optimizer states).

OpenAI Fine-Tuning (API)

If you do not need to self-host, API-based fine-tuning is the fastest path to a custom model. You upload data, kick off a job, and get a model endpoint back -- no GPU management required.

# 1. Upload training file
file = client.files.create(
    file=open("training.jsonl", "rb"),
    purpose="fine-tune"
)

# 2. Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,
        "batch_size": "auto",
        "learning_rate_multiplier": "auto"
    }
)

# 3. Check status
client.fine_tuning.jobs.retrieve(job.id)

# 4. Use fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:org:name:id",
    messages=[...]
)

Evaluation

A fine-tuned model that looks great on training loss can be catastrophically overfit. Always evaluate on held-out data using task-specific metrics, and compare against the base model to confirm the fine-tuning actually helped.

Metric	What It Measures	When to Use
Loss (train/val)	Model fitting	Always (check for overfitting)
Exact match	Perfect output match	Classification, extraction
BLEU / ROUGE	Text overlap	Summarization, translation
BERTScore	Semantic similarity	Open-ended generation
Human eval (side-by-side)	Overall quality	Before production deployment
Task-specific accuracy	Domain performance	Always

Overfitting Signals

Signal	What It Looks Like	Fix
Train loss drops, val loss rises	Classic overfit curve	Reduce epochs, add data
Val loss flat, train loss dropping	Memorization starting	Early stopping
Perfect train accuracy	Model memorized data	More data, regularization
Worse than base model on general tasks	Catastrophic forgetting	Lower LR, fewer epochs, LoRA

Cost Comparison

Fine-tuning costs vary by orders of magnitude depending on whether you use an API provider or self-host. Factor in both training cost and ongoing inference cost when making your decision.

Provider	Model	Training Cost	Inference Cost
OpenAI	GPT-4o-mini FT	$3.00/1M train tokens	$0.60/$2.40 per 1M in/out
OpenAI	GPT-4o FT	$25.00/1M train tokens	$3.75/$15.00 per 1M in/out
Together.ai	Llama 8B	~$0.50/hr (1x A100)	$0.20/1M tokens
Self-hosted	Any open model	GPU cost only	GPU cost only

Common Pitfalls

The most expensive mistake is fine-tuning when you should have used RAG, and the second most expensive is overfitting because you skipped the eval set. This list covers both and everything in between.

Pitfall	Problem	Fix
Fine-tuning for knowledge	LLMs forget; hallucination risk	Use RAG for knowledge, fine-tune for behavior
Too few examples	Underfitting, poor generalization	Start with 200+ high-quality examples
Too many epochs	Overfitting	Monitor val loss, use early stopping
Inconsistent data format	Confuses the model	Standardize all examples
No eval set	Can't detect overfitting	Hold out 10-20% of data
Ignoring base model capabilities	Redundant training	Test base model + prompt first
Not merging LoRA for production	Extra inference overhead	Merge adapters into base weights
Training on generated data only	Model amplifies errors	Include human-validated examples