RAG Systems | LegacyForward.ai

RAG Pipeline Overview

RAG is the most practical way to give an LLM access to private or current data without fine-tuning. Every production RAG system follows the same three-phase pattern -- get this pipeline right and the rest is optimization.

Document Ingestion          Retrieval                  Generation
┌──────────────┐     ┌──────────────────┐     ┌──────────────────┐
│ Load Docs    │     │ Embed Query      │     │ Build Prompt     │
│ Chunk Text   │────>│ Vector Search    │────>│ Context + Query  │
│ Embed Chunks │     │ Rerank Results   │     │ LLM Generation   │
│ Store in VDB │     │ Filter/Dedupe    │     │ Citation/Source  │
└──────────────┘     └──────────────────┘     └──────────────────┘

Chunking Strategies

Chunking is where most RAG pipelines silently fail -- chunks that are too small lose context, chunks that are too large dilute relevance. The right strategy depends entirely on your document type.

Strategy	Chunk Size	Overlap	Best For
Fixed-size	256-512 tokens	10-20%	General purpose, simple
Sentence-based	3-5 sentences	1 sentence	Narrative text
Paragraph-based	1-3 paragraphs	1 paragraph	Structured docs
Recursive character	500-1000 chars	100-200 chars	Mixed content
Semantic	Varies	Concept boundary	Research papers
Document-specific	By section/page	Headers as context	Technical docs
Agentic (LLM-based)	Varies	Determined by model	Complex/messy content

Chunk Size Guidelines

Content Type	Recommended Size	Reasoning
FAQ / Q&A	100-200 tokens	Short, self-contained
Technical docs	300-500 tokens	Enough context per concept
Legal contracts	500-1000 tokens	Clauses need full context
Code	Function/class level	Logical units
Chat logs	Per conversation turn	Natural boundaries

Chunking Code Example (LangChain)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

Embedding Models Comparison

Your choice of embedding model determines the ceiling on retrieval quality -- no amount of reranking or prompt tuning can fix a bad embedding. Match the model to your data type, language, and latency budget.

Model	Dimensions	Max Tokens	MTEB Score	Cost	Notes
OpenAI text-embedding-3-large	3072	8191	~64.6	$0.13/1M tokens	Dimension reduction supported
OpenAI text-embedding-3-small	1536	8191	~62.3	$0.02/1M tokens	Good price/performance
Cohere embed-v3	1024	512	~64.5	$0.10/1M tokens	Multi-lingual
Voyage-3	1024	32000	~67.2	$0.06/1M tokens	Long context, code-aware
BGE-large-en-v1.5	1024	512	~64.2	Free (self-host)	Open source
E5-mistral-7b	4096	32768	~66.6	Free (self-host)	Best open source
GTE-Qwen2-7B	3584	131072	~67.2	Free (self-host)	Very long context
Google text-embedding-004	768	2048	~66.3	$0.025/1M chars	Vertex AI

Embedding Best Practices

Normalize embeddings before cosine similarity
Use the same model for indexing and querying
Prefix queries with "query: " and docs with "passage: " for asymmetric models
Batch embedding requests (100-1000 items per call)
Cache embeddings; do not re-embed unchanged content

Vector Database Comparison

The vector database you choose dictates your scaling ceiling, operational complexity, and whether you can do hybrid search natively. Pick based on your scale and team capacity, not hype.

Database	Type	Max Vectors	Filtering	Hybrid Search	Managed?
Pinecone	Cloud-native	1B+	Metadata	Yes	Yes
Weaviate	Self-host/Cloud	Billions	GraphQL	Yes	Both
Qdrant	Self-host/Cloud	Billions	Payload	Yes	Both
ChromaDB	Embedded	Millions	Metadata	No	No
pgvector	PostgreSQL ext	Millions	SQL	Yes (via SQL)	Both
Milvus	Distributed	Billions	Expressions	Yes	Both
FAISS	In-memory lib	Millions	No (manual)	No	No
Elasticsearch	Distributed	Billions	Full query DSL	Yes	Both

When to Use Which

Scenario	Recommended
Prototype / POC	ChromaDB, FAISS
Already using Postgres	pgvector
Production, managed	Pinecone, Weaviate Cloud
High scale, self-managed	Milvus, Qdrant
Need full-text + vector	Elasticsearch, Weaviate

Retrieval Strategies

Dense retrieval alone misses exact keyword matches; sparse retrieval alone misses semantic meaning. Understanding when to use each strategy -- and when to combine them -- is the difference between a good RAG system and a great one.

Strategy	Description	Pros	Cons
Dense retrieval	Embedding similarity	Semantic understanding	Misses keywords
Sparse retrieval (BM25)	Keyword/TF-IDF	Exact match, fast	No semantic understanding
Hybrid	Dense + Sparse fusion	Best of both	More complex
HyDE	LLM generates hypothetical doc, embed that	Better query representation	Extra LLM call
Multi-query	LLM generates N query variants	Better recall	N x retrieval cost
Parent-child	Retrieve child, return parent	Precise match, full context	More indexing work
Contextual retrieval	Prepend document context to chunks	Better relevance	Higher indexing cost

Hybrid Search Fusion

# Reciprocal Rank Fusion (RRF)
def rrf_score(ranks, k=60):
    return sum(1.0 / (k + r) for r in ranks)

# Example: combine dense and sparse results
dense_results = vector_search(query, top_k=20)
sparse_results = bm25_search(query, top_k=20)

scores = {}
for rank, doc in enumerate(dense_results):
    scores[doc.id] = scores.get(doc.id, 0) + 1/(60 + rank)
for rank, doc in enumerate(sparse_results):
    scores[doc.id] = scores.get(doc.id, 0) + 1/(60 + rank)

final = sorted(scores.items(), key=lambda x: x[1], reverse=True)

Retrieval Metrics

You cannot improve what you do not measure. These metrics tell you whether your retrieval is actually finding the right documents and whether the LLM is using them faithfully.

Metric	Formula	What It Measures
Recall@k	relevant_in_top_k / total_relevant	Coverage of relevant docs
Precision@k	relevant_in_top_k / k	Accuracy of top results
MRR	1/rank_of_first_relevant	Position of first hit
NDCG@k	DCG@k / IDCG@k	Ranking quality
Hit Rate	queries_with_hit / total_queries	Basic retrieval success
Faithfulness	claims_supported / total_claims	LLM output grounded in context
Answer Relevancy	Semantic sim(answer, question)	Output addresses the question

Reranking

Initial vector search returns "good enough" candidates, but reranking transforms them into precisely ordered results. Adding a reranker is often the single highest-ROI improvement you can make to an existing RAG pipeline.

Reranker	Type	Latency	Quality
Cohere Rerank v3	API	~200ms	High
Jina Reranker v2	API/Self-host	~150ms	High
BGE Reranker v2-m3	Self-host	~100ms	Good
Cross-encoder (ms-marco)	Self-host	~50ms	Good
LLM-based (GPT/Claude)	API	~1-3s	Highest
ColBERT	Self-host	~30ms	Good

Reranking Pattern

# Retrieve more, rerank to fewer
candidates = vector_search(query, top_k=50)  # over-fetch
reranked = reranker.rank(query, candidates, top_k=5)  # narrow down
context = "\n\n".join([doc.text for doc in reranked])

Prompt Construction for RAG

The prompt is where retrieval meets generation -- a poorly structured RAG prompt produces hallucinations even when the right documents are retrieved. Always enforce citation and include a fallback for missing information.

Use the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."
Always cite the source document for each claim.

Context:
{chunk_1}
Source: {source_1}

{chunk_2}
Source: {source_2}

Question: {user_query}
Answer:

Common Pitfalls

Every RAG failure mode here has been encountered in real production systems. When your RAG answers are wrong, irrelevant, or missing, start debugging from this list.

Pitfall	Problem	Fix
Chunks too small	Lost context	Increase size or use parent-child
Chunks too large	Diluted relevance	Decrease size, improve chunking
No overlap	Split concepts	Add 10-20% overlap
Wrong embedding model	Poor retrieval	Benchmark on your data
No metadata filtering	Irrelevant results	Add source, date, category metadata
No reranking	Mediocre top-k	Add cross-encoder or Cohere reranker
Ignoring freshness	Stale answers	Track document timestamps, re-index
Single retrieval strategy	Missed results	Use hybrid (dense + sparse)
No evaluation	Unknown quality	Measure recall, precision, faithfulness
Embedding query = doc embedding	Sub-optimal for asymmetric search	Use query prefixes for asymmetric models

Quick Sizing Reference

Use this table to estimate infrastructure requirements before you start building. Getting the storage and database choice wrong early means a painful migration later.

Documents	Embedding Dim	Approx Vector Storage	Recommended DB
< 10K	1536	~60 MB	ChromaDB, FAISS
10K-1M	1536	~6 GB	pgvector, Qdrant
1M-100M	1024	~400 GB	Pinecone, Milvus
100M+	768-1024	~4+ TB	Milvus cluster, Elasticsearch