Quick Reference 02

RAG Systems

Quick reference for RAG pipeline design, chunking, embeddings, vector databases, and retrieval strategies.

7 min readAI ArchitectureQuick ReferenceDownload PDF

RAG Pipeline Overview

RAG is the most practical way to give an LLM access to private or current data without fine-tuning. Every production RAG system follows the same three-phase pattern -- get this pipeline right and the rest is optimization.

Document Ingestion          Retrieval                  Generation
┌──────────────┐     ┌──────────────────┐     ┌──────────────────┐
│ Load Docs    │     │ Embed Query      │     │ Build Prompt     │
│ Chunk Text   │────>│ Vector Search    │────>│ Context + Query  │
│ Embed Chunks │     │ Rerank Results   │     │ LLM Generation   │
│ Store in VDB │     │ Filter/Dedupe    │     │ Citation/Source  │
└──────────────┘     └──────────────────┘     └──────────────────┘

Chunking Strategies

Chunking is where most RAG pipelines silently fail -- chunks that are too small lose context, chunks that are too large dilute relevance. The right strategy depends entirely on your document type.

StrategyChunk SizeOverlapBest For
Fixed-size256-512 tokens10-20%General purpose, simple
Sentence-based3-5 sentences1 sentenceNarrative text
Paragraph-based1-3 paragraphs1 paragraphStructured docs
Recursive character500-1000 chars100-200 charsMixed content
SemanticVariesConcept boundaryResearch papers
Document-specificBy section/pageHeaders as contextTechnical docs
Agentic (LLM-based)VariesDetermined by modelComplex/messy content

Chunk Size Guidelines

Content TypeRecommended SizeReasoning
FAQ / Q&A100-200 tokensShort, self-contained
Technical docs300-500 tokensEnough context per concept
Legal contracts500-1000 tokensClauses need full context
CodeFunction/class levelLogical units
Chat logsPer conversation turnNatural boundaries

Chunking Code Example (LangChain)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

Embedding Models Comparison

Your choice of embedding model determines the ceiling on retrieval quality -- no amount of reranking or prompt tuning can fix a bad embedding. Match the model to your data type, language, and latency budget.

ModelDimensionsMax TokensMTEB ScoreCostNotes
OpenAI text-embedding-3-large30728191~64.6$0.13/1M tokensDimension reduction supported
OpenAI text-embedding-3-small15368191~62.3$0.02/1M tokensGood price/performance
Cohere embed-v31024512~64.5$0.10/1M tokensMulti-lingual
Voyage-3102432000~67.2$0.06/1M tokensLong context, code-aware
BGE-large-en-v1.51024512~64.2Free (self-host)Open source
E5-mistral-7b409632768~66.6Free (self-host)Best open source
GTE-Qwen2-7B3584131072~67.2Free (self-host)Very long context
Google text-embedding-0047682048~66.3$0.025/1M charsVertex AI

Embedding Best Practices

  • Normalize embeddings before cosine similarity
  • Use the same model for indexing and querying
  • Prefix queries with "query: " and docs with "passage: " for asymmetric models
  • Batch embedding requests (100-1000 items per call)
  • Cache embeddings; do not re-embed unchanged content

Vector Database Comparison

The vector database you choose dictates your scaling ceiling, operational complexity, and whether you can do hybrid search natively. Pick based on your scale and team capacity, not hype.

DatabaseTypeMax VectorsFilteringHybrid SearchManaged?
PineconeCloud-native1B+MetadataYesYes
WeaviateSelf-host/CloudBillionsGraphQLYesBoth
QdrantSelf-host/CloudBillionsPayloadYesBoth
ChromaDBEmbeddedMillionsMetadataNoNo
pgvectorPostgreSQL extMillionsSQLYes (via SQL)Both
MilvusDistributedBillionsExpressionsYesBoth
FAISSIn-memory libMillionsNo (manual)NoNo
ElasticsearchDistributedBillionsFull query DSLYesBoth

When to Use Which

ScenarioRecommended
Prototype / POCChromaDB, FAISS
Already using Postgrespgvector
Production, managedPinecone, Weaviate Cloud
High scale, self-managedMilvus, Qdrant
Need full-text + vectorElasticsearch, Weaviate

Retrieval Strategies

Dense retrieval alone misses exact keyword matches; sparse retrieval alone misses semantic meaning. Understanding when to use each strategy -- and when to combine them -- is the difference between a good RAG system and a great one.

StrategyDescriptionProsCons
Dense retrievalEmbedding similaritySemantic understandingMisses keywords
Sparse retrieval (BM25)Keyword/TF-IDFExact match, fastNo semantic understanding
HybridDense + Sparse fusionBest of bothMore complex
HyDELLM generates hypothetical doc, embed thatBetter query representationExtra LLM call
Multi-queryLLM generates N query variantsBetter recallN x retrieval cost
Parent-childRetrieve child, return parentPrecise match, full contextMore indexing work
Contextual retrievalPrepend document context to chunksBetter relevanceHigher indexing cost

Hybrid Search Fusion

# Reciprocal Rank Fusion (RRF)
def rrf_score(ranks, k=60):
    return sum(1.0 / (k + r) for r in ranks)

# Example: combine dense and sparse results
dense_results = vector_search(query, top_k=20)
sparse_results = bm25_search(query, top_k=20)

scores = {}
for rank, doc in enumerate(dense_results):
    scores[doc.id] = scores.get(doc.id, 0) + 1/(60 + rank)
for rank, doc in enumerate(sparse_results):
    scores[doc.id] = scores.get(doc.id, 0) + 1/(60 + rank)

final = sorted(scores.items(), key=lambda x: x[1], reverse=True)

Retrieval Metrics

You cannot improve what you do not measure. These metrics tell you whether your retrieval is actually finding the right documents and whether the LLM is using them faithfully.

MetricFormulaWhat It Measures
Recall@krelevant_in_top_k / total_relevantCoverage of relevant docs
Precision@krelevant_in_top_k / kAccuracy of top results
MRR1/rank_of_first_relevantPosition of first hit
NDCG@kDCG@k / IDCG@kRanking quality
Hit Ratequeries_with_hit / total_queriesBasic retrieval success
Faithfulnessclaims_supported / total_claimsLLM output grounded in context
Answer RelevancySemantic sim(answer, question)Output addresses the question

Reranking

Initial vector search returns "good enough" candidates, but reranking transforms them into precisely ordered results. Adding a reranker is often the single highest-ROI improvement you can make to an existing RAG pipeline.

RerankerTypeLatencyQuality
Cohere Rerank v3API~200msHigh
Jina Reranker v2API/Self-host~150msHigh
BGE Reranker v2-m3Self-host~100msGood
Cross-encoder (ms-marco)Self-host~50msGood
LLM-based (GPT/Claude)API~1-3sHighest
ColBERTSelf-host~30msGood

Reranking Pattern

# Retrieve more, rerank to fewer
candidates = vector_search(query, top_k=50)  # over-fetch
reranked = reranker.rank(query, candidates, top_k=5)  # narrow down
context = "\n\n".join([doc.text for doc in reranked])

Prompt Construction for RAG

The prompt is where retrieval meets generation -- a poorly structured RAG prompt produces hallucinations even when the right documents are retrieved. Always enforce citation and include a fallback for missing information.

Use the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."
Always cite the source document for each claim.

Context:
{chunk_1}
Source: {source_1}

{chunk_2}
Source: {source_2}

Question: {user_query}
Answer:

Common Pitfalls

Every RAG failure mode here has been encountered in real production systems. When your RAG answers are wrong, irrelevant, or missing, start debugging from this list.

PitfallProblemFix
Chunks too smallLost contextIncrease size or use parent-child
Chunks too largeDiluted relevanceDecrease size, improve chunking
No overlapSplit conceptsAdd 10-20% overlap
Wrong embedding modelPoor retrievalBenchmark on your data
No metadata filteringIrrelevant resultsAdd source, date, category metadata
No rerankingMediocre top-kAdd cross-encoder or Cohere reranker
Ignoring freshnessStale answersTrack document timestamps, re-index
Single retrieval strategyMissed resultsUse hybrid (dense + sparse)
No evaluationUnknown qualityMeasure recall, precision, faithfulness
Embedding query = doc embeddingSub-optimal for asymmetric searchUse query prefixes for asymmetric models

Quick Sizing Reference

Use this table to estimate infrastructure requirements before you start building. Getting the storage and database choice wrong early means a painful migration later.

DocumentsEmbedding DimApprox Vector StorageRecommended DB
< 10K1536~60 MBChromaDB, FAISS
10K-1M1536~6 GBpgvector, Qdrant
1M-100M1024~400 GBPinecone, Milvus
100M+768-1024~4+ TBMilvus cluster, Elasticsearch