LLMs, SLMs & Multimodal Models | LegacyForward.ai

01. The LLM Landscape

Diagram 1

Model landscape positioned by capability (vertical) versus cost-per-token (horizontal). The dashed curve represents the value frontier — models on it deliver optimal capability for their price point.

When people say a model is "large," they are referring to its parameter count — the number of learned numerical weights that encode the model's knowledge and behavior. But parameter count is just the headline number. What actually determines a model's capability is the combination of parameters, training data quality, and compute used during training. A model with fewer parameters trained on better, more carefully curated data can outperform a larger model trained sloppily. This is why some of the most surprising results in recent years have been small models punching far above their weight.

The major providers break into two camps. On the closed-weight side: OpenAI (GPT-4o, o1, o3), Anthropic (Claude 3.5 and 3.7), Google DeepMind (Gemini Flash, Pro, Ultra), and others. These companies keep weights proprietary and give you access through paid APIs. On the open-weight side: Meta AI (Llama 3.x), Mistral AI, Alibaba (Qwen 2.5), Microsoft (Phi-3, Phi-4), Google (Gemma 2), and DeepSeek. These labs release model weights publicly — you can download them, run them locally, fine-tune them, and deploy on your own infrastructure.

The landscape changes fast. Models that were state-of-the-art in early 2024 were beaten by mid-2024. GPT-3.5-level capability, which cost meaningful money in 2022, can now run locally for free on a mid-range laptop. The specific model names you memorize today will be outdated within months, but the frameworks for evaluating and selecting models stay relevant.

Think of it like this: The model landscape is like the smartphone market in 2010. New flagships appear every few months, last year's premium becomes this year's budget option, and the real skill is not knowing which phone is best today — it is knowing how to evaluate which phone fits your needs.

What This Means for Practitioners

Closed-weight vs. open-weight is an engineering decision, not a philosophical one. Here is when to choose each:

Factor	Closed-Weight API	Open-Weight Self-Hosted
Best for	Rapid prototyping, complex reasoning, user-facing chat	Data-sensitive workloads, high-volume inference, custom fine-tuning
Data privacy	Data leaves your premises	Data stays on your infrastructure
Cost model	Pay per token (scales linearly)	Pay for hardware (fixed cost, unlimited tokens)
Maintenance	Zero — provider handles everything	You manage GPUs, updates, monitoring
Customization	Limited to prompt engineering	Full fine-tuning, quantization, custom serving
Availability	Subject to provider outages and pricing changes	You control uptime and versioning

Context windows have expanded dramatically but longer is not always better. Claude 3.7 Sonnet offers 200K tokens, Gemini 2.0 can handle 1M tokens. But research consistently shows a "lost in the middle" problem — information buried deep in a long context is recalled less reliably. Structured retrieval (RAG) often outperforms naive context stuffing even when the window is technically large enough.

Mixture-of-Experts (MoE) is how frontier models stay fast despite being huge. An MoE model has multiple feed-forward blocks ("experts") and a routing network that selects a small subset for each token. Mixtral 8x7B has 46B total parameters but only activates about 12B per inference step — quality closer to a 40B+ dense model at the speed of a 12B model. GPT-4 is widely speculated to use this approach.

Benchmarks are useful for rough comparisons but unreliable for your specific task. MMLU measures factual recall, not the reasoning or instruction-following that matters in production. The benchmarks that matter most: LMSYS Chatbot Arena (human preference voting — the gold standard), LiveBench (dynamically updated to prevent contamination), and GPQA (graduate-level questions that even PhD experts struggle with). Always build your own evaluation set for your specific use case.

02. Key Models — What You Need to Know

Rather than memorize every model's specifications, focus on understanding which model to reach for in which situation. Here is the practical landscape.

GPT-4o is OpenAI's flagship. Natively multimodal (text, images, audio in one API call), fast, and strong on most benchmarks. Your default choice when you need maximum capability from the OpenAI ecosystem.

Claude 3.5 and 3.7 Sonnet from Anthropic are the best balance of speed, intelligence, and cost in the Claude family. Particularly strong at following complex, nuanced instructions without going off-script — popular for production systems where reliability matters. Claude 3.7 Sonnet adds "extended thinking" for complex reasoning at the cost of higher latency.

Gemini 2.0 Flash and Pro from Google are notable for native multimodality from the ground up and extremely competitive pricing. Flash is among the most cost-effective capable models available — a default choice for high-volume applications.

Llama 3.x from Meta is the most important open-weight family. The 8B variant runs on consumer hardware. The 70B variant is genuinely competitive with frontier closed models on many tasks. All variants are Apache 2.0 licensed for commercial use.

Mistral and Mixtral push the envelope on open-weight efficiency. Common choice for self-hosted deployments, especially in European enterprises with data sovereignty requirements.

DeepSeek-R1 is an open-weight reasoning model competitive with OpenAI's o1 on math and coding benchmarks. The fact that this capability is now available as downloadable weights is a significant shift.

Think of it like this: Choosing an LLM is like choosing a vehicle. A sports car (GPT-4o, Claude Opus) is amazing for winding mountain roads but expensive to run daily. A reliable sedan (GPT-4o-mini, Gemini Flash) handles 90% of trips at a fraction of the cost. A pickup truck (self-hosted Llama 70B) costs more upfront but pays for itself if you are hauling loads every day.

What This Means for Practitioners: Model Comparison

Model	Provider	Strengths	Best For	Pricing (per 1M tokens in/out)
GPT-4o	OpenAI	Strong reasoning, multimodal, fast	Complex tasks, user-facing chat	$2.50 / $10.00
GPT-4o-mini	OpenAI	Very cheap, surprisingly capable	High-volume simple tasks	$0.15 / $0.60
Claude 3.7 Sonnet	Anthropic	Instruction following, extended thinking, 200K context	Production systems, complex prompts	$3.00 / $15.00
Gemini 2.0 Flash	Google	Cheapest capable model, 1M context	Cost-sensitive pipelines	$0.075 / $0.30
o1 / o3	OpenAI	Multi-step reasoning, math, code	Genuinely hard reasoning problems	3-10x GPT-4o cost
Llama 3.1 70B	Meta (open)	Self-hostable, fine-tunable, competitive quality	Data-sensitive, high-volume, custom models	Hardware cost only
Phi-4 (14B)	Microsoft (open)	Small but strong reasoning, runs on consumer GPU	Edge, offline, budget-constrained	Hardware cost only

Practical default choices for this course: Claude Sonnet or GPT-4o for complex tasks. Gemini Flash or GPT-4o-mini for cost-sensitive pipelines. Llama 3 8B/70B for self-hosted experiments. This covers 95% of real-world use cases.

03. Small Language Models (SLMs)

The excitement around GPT-4-scale models can create a false impression that bigger is always better. In many production scenarios, the opposite is true. A small language model — typically under 10 billion parameters — can run on a single consumer laptop, respond in near-real-time, operate entirely offline, and cost essentially nothing per query once you own the hardware.

Microsoft's Phi series demonstrates what carefully chosen training data can accomplish. Phi-3-mini at 3.8B parameters fits in 4GB of VRAM, has a 128K-token context window, and matches or beats Llama 3 8B on many reasoning benchmarks. Phi-4 at 14B scores higher than Llama 3 70B on several math benchmarks while being five times smaller.

Gemma 2 from Google at 2B, 9B, and 27B parameters is specifically optimized for on-device inference. The 2B variant can run on a Raspberry Pi or a modern smartphone.

Quantization makes small models even smaller. Representing weights in 4-bit precision instead of 16-bit reduces memory by 4x. A 7B model goes from 14GB (needs a $10K GPU) to ~3.5GB (runs on a laptop). Quality penalty is typically 1-3% for int8, 3-7% for int4 — often acceptable for production tasks. The dominant format is GGUF, used by llama.cpp and Ollama.

What This Means for Practitioners: When to Choose an SLM

Scenario	Why SLM Wins	Recommended Models
Privacy-critical data (medical, legal, financial)	Data never leaves your premises	Llama 3 8B, Phi-4, Mistral 7B
Edge / offline deployment (mobile, IoT, field tools)	No internet required	Phi-3-mini, Gemma 2B
High-volume simple tasks (10M+ classifications/day)	Near-zero marginal cost	Phi-3-mini quantized, Gemma 9B
Development and testing	No API costs, instant iteration	Any model via Ollama
Custom fine-tuned specialist	Full control, one base + many adapters	Llama 3 8B + LoRA

The unit economics are compelling. A frontier API at $3/1M tokens processing 1M daily queries at 500 tokens average costs ~$547K/year. A 7B model on a single GPU at $2.50/hour costs ~$22K/year — a 25x reduction. The tradeoff is quality (which you must measure on your specific task) and operational complexity.

Ollama is the fastest way to run local models. A single command — ollama run llama3.2 — downloads, quantizes, and serves a model. It also exposes an OpenAI-compatible API on localhost, meaning any code written for the OpenAI SDK works unchanged against a local model.

from openai import OpenAI

# Redirect to local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "Explain quantization in two sentences."}
    ],
    temperature=0.7,
)
print(response.choices[0].message.content)

Think of it like this: An SLM is like a specialist surgeon vs. a general hospital. The hospital (GPT-4o) can handle anything, but for the one surgery you do 10,000 times a day, the specialist (a fine-tuned 7B model) is faster, cheaper, and just as good at that specific procedure.

04. Multimodal Models

A multimodal model processes multiple types of data — most commonly text and images, but increasingly audio and video. This is not a minor feature addition; it eliminates entire categories of preprocessing pipelines.

The most immediately practical capability is vision: send an image alongside a text prompt and have the model reason about both. Screenshot of an error message? Ask the model to debug it. Photo of a whiteboard? Ask for an explanation. PDF page with tables and charts? Send the page as an image and extract structured data directly — no brittle OCR pipeline needed.

Document understanding is where multimodal models change the game. Traditional PDF extraction meant: convert to text (lossy), run OCR on scans (error-prone), then process with NLP. With multimodal models, you send a page screenshot directly — the model reads text, understands layout, interprets figures, and reasons about everything together.

Audio capabilities add another dimension. OpenAI's Whisper handles speech-to-text with near-human accuracy. GPT-4o's real-time audio API processes speech directly — preserving tone and emotion rather than just transcribing words.

Diagram 2

Multimodal input processing: images are split into 16x16 patches, encoded by a Vision Transformer, and projected into the LLM's embedding space. Text is tokenized and embedded separately. Both are concatenated into a single unified sequence before entering the language model decoder.

What This Means for Practitioners: Multimodal Use Cases

Use Case	Best Models	Notes
Document extraction (tables, forms)	Claude 3+, GPT-4o	Send page screenshots, not extracted text
Code screenshot debugging	GPT-4o, Claude 3+	Paste error screenshots directly
Chart and diagram analysis	Gemini 1.5+, Claude 3+	Model reads visual data natively
Video understanding	Gemini 2.0 (native video)	Matches audio with visual events at timestamps
Speech-to-text preprocessing	Whisper (open-source)	Runs locally, 5GB VRAM for medium model
Voice assistants	GPT-4o real-time audio API	Processes speech directly, not transcribe-then-LLM

Sending images via API requires encoding as base64 and including in the message content as a typed content block:

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

image_data = base64.standard_b64encode(Path("chart.png").read_bytes()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_data}},
            {"type": "text", "text": "Extract all data from this chart as JSON."}
        ],
    }],
)
print(message.content[0].text)

Image tokens cost money. A 512x512 image consumes approximately 170 tokens in OpenAI's pricing model. Factor this into your cost estimates for vision-heavy pipelines.

05. Model Selection Framework

One of the most common mistakes is optimizing for the "best" model on general benchmarks. Benchmark performance may have no correlation with performance on your specific task. A model that scores 5 points lower on MMLU might outperform the leader on your use case because it was fine-tuned on similar data or follows your prompt patterns better.

The "good enough" principle is one of the most commercially important ideas in applied AI. A model that achieves 90% of the quality at 20% of the cost and 300% of the speed is almost always the right production choice — provided you can measure that quality gap.

What This Means for Practitioners: The Decision Framework

Start with the latency requirement:

Latency Need	Model Tier	Examples
Under 500ms (interactive UI)	Fast/cheap	GPT-4o-mini, Gemini Flash, Claude Haiku, local SLM
1-5 seconds (standard chat)	Balanced	Claude Sonnet, GPT-4o, Gemini Pro
5-30 seconds (complex reasoning)	Premium reasoning	o1/o3, Claude extended thinking
Non-real-time (batch processing)	Cheapest that meets quality	Batch API at 50% discount, self-hosted

Then check these constraints:

Vision needed? GPT-4o, Claude 3+, Gemini 1.5+, or Llama 3.2 (open-weight)
Private data? Self-host: Llama 3, Mistral, Phi-4, Qwen via Ollama or vLLM
Multilingual? Qwen 2.5 (strong CJK), Claude (strong European), GPT-4o (broad)
Budget-constrained high volume? Gemini Flash ($0.075/1M in) or self-hosted SLM

The professional testing methodology:

Define your task and collect 50-200 representative examples
Label them with correct outputs (human-annotated or from a trusted model)
Run each candidate model with your production prompt
Score outputs against golden labels
Sort by quality-per-dollar on your specific task
Pick the model at the knee of the curve

import tiktoken

def estimate_cost(prompt: str, model: str = "gpt-4o") -> dict:
    PRICING = {
        "gpt-4o": (2.50, 10.00), "gpt-4o-mini": (0.15, 0.60),
        "claude-sonnet-4-6": (3.00, 15.00), "gemini-2.0-flash": (0.075, 0.30),
    }
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    input_tokens = len(encoding.encode(prompt))
    price_in, price_out = PRICING.get(model, (3.0, 15.0))
    estimated_output = 300
    return {
        "input_tokens": input_tokens,
        "est_cost_usd": round((input_tokens / 1_000_000) * price_in + (estimated_output / 1_000_000) * price_out, 6),
    }

Important Note on Pricing: API prices change frequently. Always check the current pricing page for each provider before building cost estimates into business cases.

Interview Ready

Elevator Pitch — 2-Minute Interview Explanation: "The modern model landscape spans three categories. Large Language Models like GPT-4o and Claude Sonnet deliver frontier reasoning and multimodal capabilities through paid APIs. Small Language Models like Phi-4 and Gemma 2 can run on a single consumer GPU and are the right choice for privacy, offline access, or high-volume inference at near-zero marginal cost. Multimodal models extend beyond text to accept images, audio, and video — using Vision Transformers to encode image patches into the same vector space as text tokens. The key engineering decision is not picking the 'best' model — it is building an evaluation set for your specific task and finding the model at the knee of the quality-per-dollar curve."

Interview Questions

Question	What They're Really Asking
When would you choose an SLM over a frontier API?	Can you make practical cost-quality tradeoffs?
How does a multimodal model process an image alongside text?	Do you understand ViT + projection, or is it magic?
What is Mixture-of-Experts and why does it matter?	Can you explain how MoE decouples parameters from inference cost?
How would you decide between closed-weight API and self-hosted open-weight?	Do you think about privacy, cost, latency, and operations holistically?
How would you select a model for a new production use case?	Do you have a systematic methodology or just pick the most famous model?

Model Answers

Q1 — SLM vs. frontier: SLMs win when data cannot leave your premises, when you need offline/edge deployment, or when running millions of queries daily on a well-defined task. The decision framework: establish a quality ceiling with a frontier model, build a golden eval set, then test smaller alternatives to find where quality drops below your threshold. A quantized 7B model can replace a $500K/year API bill if the task is bounded. Q2 — Multimodal image processing: The image is divided into 16x16 pixel patches. Each patch is encoded by a Vision Transformer into a visual embedding. A projection layer maps these embeddings into the same dimensional space as text token embeddings. Both are concatenated into a single sequence processed by the language model decoder — image patches become "visual tokens." Q3 — Mixture-of-Experts: In a dense transformer, every token passes through the same feed-forward network. In MoE, multiple expert FFN blocks exist and a routing network selects a small subset (typically 2) per token. Mixtral 8x7B has 46B total parameters but activates only ~12B per step — quality of a 40B+ dense model at the speed of a 12B model. Q4 — Closed vs. open-weight: Closed APIs offer frontier quality, zero infrastructure, and continuous updates — but data leaves your premises and costs scale linearly with volume. Open-weight models keep data on-premises, allow fine-tuning, and offer near-zero marginal cost at high volume — but require GPU infrastructure and operational overhead. The decision depends on data sensitivity, volume (at 10M+ daily queries self-hosting saves 25x), and the quality gap on your specific task. Q5 — Model selection methodology: Define the task, collect 50-200 representative examples, label them, run all candidate models, score against golden labels, and sort by quality-per-dollar. Pick the model at the knee of the curve — the one after which additional quality gains cost disproportionately more.

Common Mistakes

Defaulting to the largest model for every task. GPT-4o for simple classification wastes money and adds latency. Measure quality on your specific task — GPT-4o-mini or a fine-tuned 7B model is often sufficient.
Ignoring parameter count when planning infrastructure. A 70B model at FP16 requires ~140GB VRAM — two A100 GPUs minimum. Proposals to "just deploy Llama 70B" without hardware planning reveal a gap between theory and production readiness.
Treating multimodal as OCR + LLM. Native multimodal models process layout, figures, and text together. A pipeline of "OCR the document, then send text to an LLM" misses the architectural shift — direct image input handles tables and charts far more reliably.

Previous Module

01 · Foundations of GenAI

Next Module

03 · APIs for LLMs

Phase: Foundations → API Access