Quick Reference 15

Transformer Foundations

Quick reference for transformer architecture, attention mechanism, positional encoding, tokenization, model sizes, and inference optimization.

10 min readAI FundamentalsQuick ReferenceDownload PDF

Transformer Architecture

The transformer is the architecture behind every modern LLM. Understanding its building blocks -- embeddings, attention, and feed-forward layers -- is essential for making informed decisions about model selection, fine-tuning, and inference optimization.

Input Tokens
     │
     ▼
┌──────────────┐
│  Embedding   │  Token embedding + Positional encoding
│  Layer       │  (d_model dimensions)
└──────┬───────┘
       │
       ▼ (repeated N times)
┌──────────────────────────────┐
│  Transformer Block           │
│  ┌────────────────────────┐  │
│  │ Multi-Head Attention   │  │
│  │ (self-attention)       │  │
│  └───────────┬────────────┘  │
│         Add & LayerNorm      │
│  ┌───────────▼────────────┐  │
│  │ Feed-Forward Network   │  │
│  │ (FFN: d_model -> 4*d   │  │
│  │  -> d_model)           │  │
│  └───────────┬────────────┘  │
│         Add & LayerNorm      │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│  Output Head                 │
│  Linear (d_model -> vocab)   │
│  Softmax                     │
└──────────────────────────────┘

Key Dimensions

SymbolNameTypical ValuesDescription
d_modelModel dimension768, 1024, 4096, 8192Embedding/hidden size
d_ffFeed-forward dimension4 * d_modelInner FFN width
n_headsAttention heads12, 16, 32, 64, 128Parallel attention
d_headHead dimensiond_model / n_headsPer-head dimension
n_layersTransformer blocks12, 24, 32, 80, 128Depth of model
vocab_sizeVocabulary32K, 50K, 128K, 256KTokenizer vocabulary
context_lengthMax sequence2K, 8K, 32K, 128K, 1MInput window

Self-Attention Mechanism

Self-attention is the core innovation that makes transformers work -- it lets every token attend to every other token, enabling the model to capture long-range dependencies that RNNs and CNNs cannot. Understanding it is key to grasping why models behave the way they do.

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Step-by-step:

  1. Project input into Query (Q), Key (K), Value (V) matrices
  2. Compute attention scores: QK^T (dot product of queries and keys)
  3. Scale by 1/sqrt(d_k) (prevents softmax saturation)
  4. Apply mask (causal mask for decoder, padding mask)
  5. Apply softmax (normalize to probabilities)
  6. Multiply by Values (V) to get weighted output

Multi-Head Attention

# Conceptual implementation
def multi_head_attention(x, W_q, W_k, W_v, W_o, n_heads):
    batch, seq_len, d_model = x.shape
    d_head = d_model // n_heads

    Q = x @ W_q  # (batch, seq_len, d_model)
    K = x @ W_k
    V = x @ W_v

    # Reshape to (batch, n_heads, seq_len, d_head)
    Q = Q.reshape(batch, seq_len, n_heads, d_head).transpose(1, 2)
    K = K.reshape(batch, seq_len, n_heads, d_head).transpose(1, 2)
    V = V.reshape(batch, seq_len, n_heads, d_head).transpose(1, 2)

    # Scaled dot-product attention per head
    scores = (Q @ K.transpose(-2, -1)) / (d_head ** 0.5)
    scores = scores.masked_fill(causal_mask, float('-inf'))
    attn = softmax(scores, dim=-1)
    output = attn @ V  # (batch, n_heads, seq_len, d_head)

    # Concatenate heads and project
    output = output.transpose(1, 2).reshape(batch, seq_len, d_model)
    return output @ W_o

Attention Variants

VariantComplexityDescriptionUsed In
Full attentionO(n^2)Standard self-attentionGPT, BERT, most models
Multi-Query (MQA)O(n^2), less memoryShared K,V across headsPaLM, Falcon
Grouped-Query (GQA)O(n^2), less memoryK,V shared within groupsLlama 2/3, Gemma
Flash AttentionO(n^2) compute, O(n) memoryMemory-efficient exact attentionMost modern training
Sparse attentionO(n * sqrt(n))Attend to subset of positionsLongformer, BigBird
Linear attentionO(n)Approximate, kernel-basedRWKV-like
Sliding windowO(n * w)Fixed window + global tokensMistral, Longformer
Ring attentionO(n^2) distributedDistributed across devicesVery long sequences

Positional Encoding

Transformers have no inherent sense of token order -- positional encoding is what gives them sequence awareness. The choice of encoding method directly determines how well a model handles long contexts and extrapolates beyond its training length.

MethodTypeKey PropertyUsed In
SinusoidalAbsolute, fixedNo training neededOriginal Transformer
Learned absoluteAbsolute, trainedLimited to training lengthGPT-2, BERT
RoPE (Rotary)Relative, applied to Q/KExtrapolates well, efficientLlama, Qwen, Gemma
ALiBiRelative, bias on attentionGood length extrapolationMPT, BLOOM
YaRNRoPE extensionExtends RoPE context lengthLong-context Llama
NTK-aware scalingRoPE extensionInterpolation for longer contextMany fine-tunes

Rotary Position Embedding (RoPE)

Core idea: Rotate Q and K vectors based on position
- Position m, dimension pair (i, i+1):
  RoPE(x, m) applies rotation by angle m * theta_i
  where theta_i = 1 / 10000^(2i/d)

Key property: dot product of rotated Q and K depends
on relative position (m - n), not absolute positions.

Tokenization

Tokenization determines how text is split into the atomic units the model processes. A bad tokenizer wastes tokens on common words and butchers rare ones -- understanding your tokenizer's behavior is essential for accurate cost estimation and debugging unexpected model outputs.

Methods Comparison

MethodAlgorithmVocab SizeUsed By
BPE (Byte-Pair Encoding)Merge most frequent pairs32K-100KGPT, Llama
WordPieceMaximize likelihood merge30K-50KBERT
SentencePiece (BPE/Unigram)Language-agnostic BPE or Unigram32K-256KT5, Gemma, Llama
UnigramRemove least useful tokens32K-128KT5, mBART
Byte-level BPEBPE on raw bytes50K-100KGPT-2, GPT-4
tiktokenOptimized BPE implementation100K-200KOpenAI models

Token Estimation

ContentTokens per Word (English)Notes
English prose~1.3Average
Code~1.5-2.0More tokens for syntax
JSON/structured~2.0-3.0Lots of special chars
Non-English~1.5-3.0Depends on language
Numbers~1 per 1-3 digitsVaries by tokenizer
# Count tokens (OpenAI)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, world!")
print(len(tokens))  # 4

# Count tokens (Hugging Face)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokens = tok.encode("Hello, world!")
print(len(tokens))  # varies by tokenizer

Model Sizes

Knowing the architectural details of popular models helps you estimate memory requirements, choose the right hardware, and understand trade-offs between model families. These numbers change fast -- but the relationships between parameters, layers, and memory remain consistent.

ModelParametersLayersd_modelHeadsContextGQA
Llama 3.1 8B8B32409632128KYes (8 KV)
Llama 3.1 70B70B80819264128KYes (8 KV)
Llama 3.1 405B405B12616384128128KYes (16 KV)
Mistral 7B7B3240963232KYes
Gemma 2 9B9B423584168KYes
Gemma 2 27B27B464608328KYes
Qwen 2.5 72B72B80819264128KYes
DeepSeek V3671B (37B active)617168128128KMLA

Parameter Estimation

Parameters (approx) =
  Embedding:     vocab_size * d_model
  Per layer:     12 * d_model^2   (attention: 4*d^2, FFN: 8*d^2)
  Total:         vocab * d + n_layers * 12 * d^2

Memory (inference, float16):
  ~2 bytes per parameter
  7B model -> ~14 GB
  70B model -> ~140 GB

Memory Requirements

Model SizeFP32FP16/BF16INT8INT4
1B4 GB2 GB1 GB0.5 GB
7B28 GB14 GB7 GB3.5 GB
13B52 GB26 GB13 GB6.5 GB
70B280 GB140 GB70 GB35 GB
405B1.6 TB810 GB405 GB203 GB

Inference Optimization

Inference optimization is where theory meets production cost. A 2x speedup from quantization or batching translates directly into halved GPU bills -- and for many applications, the difference between a usable product and an unaffordably slow one.

KV Cache

Without KV cache: For each new token, recompute attention for ALL previous tokens
With KV cache:    Store K, V tensors for previous tokens, only compute for new token

KV cache size per token = 2 * n_layers * n_kv_heads * d_head * precision_bytes
  For Llama 3.1 8B (FP16): 2 * 32 * 8 * 128 * 2 = 131 KB per token
  For 128K context: ~16 GB of KV cache alone

Quantization

MethodBitsQuality LossSpeedupMemory Savings
FP32 (baseline)32None1x1x
FP16 / BF1616Negligible~2x2x
INT8 (W8A8)8Very small~2-3x4x
INT4 (GPTQ, AWQ)4Small~3-4x8x
GGUF Q4_K_M~4.8Small~3x~6.5x
2-bit2Noticeable~4-5x16x
1-bit (BitNet)1.58Moderate~5-6x~20x

Quantization Tools

ToolMethodsFramework
GPTQPost-training 4/8-bitHuggingFace, vLLM
AWQActivation-aware 4-bitvLLM, TGI
llama.cpp (GGUF)Multiple quant levelsllama.cpp, Ollama
bitsandbytesNF4, INT8HuggingFace
SmoothQuantW8A8TensorRT-LLM
FP88-bit floating pointTensorRT-LLM, vLLM

Inference Serving Frameworks

FrameworkKey FeatureBest For
vLLMPagedAttention, continuous batchingHigh-throughput serving
TensorRT-LLMNVIDIA optimized, FP8Maximum GPU performance
TGI (HuggingFace)Easy deployment, quantizationHuggingFace models
OllamaLocal, easy setupDesktop/development
SGLangOptimized scheduling, RadixAttentionComplex prompting
llama.cppCPU/Metal/CUDA, GGUF formatEdge, CPU inference
ExLlamaV2Fast GPTQ inferenceConsumer GPUs

Throughput Optimization Techniques

TechniqueDescriptionImpact
Continuous batchingProcess new requests as slots free2-5x throughput
PagedAttention (vLLM)Non-contiguous KV cache memoryBetter memory utilization
Speculative decodingDraft model proposes, main model verifies2-3x speed
Prefix cachingCache KV for shared system promptsSaves redundant computation
Tensor parallelismSplit model across GPUsRun larger models
Pipeline parallelismSplit layers across GPUsRun deeper models
Flash Attention 2/3Fused attention kernel2-4x attention speedup

Key Formulas

These formulas let you estimate compute costs, memory requirements, and optimal training configurations from first principles. Bookmark them -- they come up every time you plan a training run or evaluate a new model.

Attention Complexity

Time complexity:  O(n^2 * d)   where n = sequence length, d = dimension
Memory complexity: O(n^2)       for attention matrix
                   O(n * d)     with Flash Attention

FLOPs Estimation

Per token (forward pass):
  FLOPs ~ 2 * P  (where P = number of parameters)

Per token (forward + backward):
  FLOPs ~ 6 * P

Training total FLOPs:
  C = 6 * P * D  (where D = number of training tokens)

Chinchilla optimal:
  D ~ 20 * P  (train for 20 tokens per parameter)

Scaling Laws

Loss ~ (C / C_0) ^ (-alpha)

Where:
  C = compute budget (FLOPs)
  alpha ~ 0.05 for Chinchilla-style scaling

Key insight: Performance improves as a power law of:
  - Model size (parameters)
  - Dataset size (tokens)
  - Compute budget (FLOPs)

Common Pitfalls

These pitfalls cause the most confusion for practitioners working with transformers in production -- from OOM crashes to silent quality degradation from wrong quantization choices.

PitfallProblemFix
Context length > training lengthDegraded qualityUse models trained on long context, or apply RoPE scaling
Wrong quantization for taskQuality loss on reasoningUse higher precision (FP16/INT8) for complex tasks
KV cache OOMOut of memory for long sequencesUse PagedAttention (vLLM), or reduce batch size
Ignoring tokenizationUnexpected token counts/splitsAlways check tokenizer output for edge cases
FP16 overflow in trainingNaN lossUse BF16 (wider dynamic range) or mixed precision
Too many attention headsDiminishing returnsFollow d_head = 64-128 convention
Batch size too smallUnderutilized GPUUse gradient accumulation to increase effective batch
Not using Flash AttentionSlow training, high memoryEnable FlashAttention-2 or 3 for modern GPUs
Generating token by token without batchingLow throughputUse continuous batching with vLLM/TGI