Enterprise RAG | LegacyForward.ai

Enterprise RAG vs Basic RAG

A demo RAG pipeline and a production enterprise RAG system are fundamentally different in scope. This comparison shows exactly where basic RAG falls short and what you need to add for real-world deployment.

Aspect	Basic RAG	Enterprise RAG
Data sources	Single source	Multi-source, multi-format
Access control	None	Per-document ACLs
Search	Dense vector only	Hybrid (dense + sparse + metadata)
Ranking	Top-k by similarity	Reranking + business rules
Freshness	Static index	Incremental updates, TTL
Citation	None or approximate	Exact source + page/section
Scale	Thousands of docs	Millions of docs
Monitoring	None	Retrieval quality, latency, cost
Data lineage	None	Full provenance tracking

Ingestion Pipeline

The ingestion pipeline is the foundation of your entire RAG system -- garbage in means garbage out at query time. Getting document processing, metadata extraction, and ACL mapping right during ingestion prevents every downstream problem.

Data Sources           Processing              Storage
┌──────────┐     ┌──────────────────┐     ┌─────────────┐
│ SharePoint│     │ Format Detection │     │ Vector DB   │
│ Confluence│────>│ Text Extraction  │────>│ (embeddings)│
│ Google Dr │     │ Chunking         │     ├─────────────┤
│ S3 / GCS  │     │ Metadata Extract │     │ Search Index│
│ Databases │     │ ACL Mapping      │     │ (BM25/FTS)  │
│ APIs      │     │ Embedding        │     ├─────────────┤
└──────────┘     │ Quality Filter   │     │ Metadata DB │
                  └──────────────────┘     │ (ACLs, etc) │
                                           └─────────────┘

Document Processing by Format

Format	Extraction Tool	Notes
PDF	PyMuPDF, Unstructured, Amazon Textract	OCR for scanned docs
DOCX/PPTX	python-docx, python-pptx, Unstructured	Preserve structure
HTML	BeautifulSoup, Trafilatura	Strip boilerplate
Markdown	Direct parse	Preserve headers as metadata
CSV/Excel	pandas	Row-level or table-level chunks
Images	OCR (Tesseract, Textract) + VLM	Vision model for diagrams
Audio/Video	Whisper transcription	Timestamps as metadata
Confluence/Wiki	API extraction	Page hierarchy as context
Code	AST parsing	Function/class level

Metadata to Extract and Store

Field	Purpose	Example
`source_id`	Unique document identifier	`confluence://page/12345`
`title`	Display and search	"Q3 Revenue Report"
`source_type`	Format/origin	`pdf`, `confluence`, `slack`
`created_at`	Freshness sorting	`2025-11-15T10:00:00Z`
`updated_at`	Freshness, re-indexing	`2026-01-20T14:30:00Z`
`author`	Attribution	"jane.doe@company.com"
`acl_groups`	Access control	`["engineering", "all-staff"]`
`acl_users`	Access control	`["user123"]`
`department`	Filtering	"Engineering"
`chunk_index`	Ordering within doc	`3` (of 12)
`parent_id`	Parent-child linking	`doc_abc_chunk_0`
`confidence`	Extraction quality	`0.95`

Access Control (ACL) Filtering

In enterprise environments, returning a document the user should not see is a security incident, not a UX bug. ACL filtering must be enforced at the retrieval layer, not as a post-processing step.

Pattern: Pre-filter at Query Time

def search_with_acl(query: str, user: User, top_k: int = 10):
    # Get user's access groups
    user_groups = get_user_groups(user.id)

    # Build metadata filter
    acl_filter = {
        "$or": [
            {"acl_groups": {"$in": user_groups}},
            {"acl_users": {"$in": [user.id]}},
            {"acl_groups": {"$in": ["public"]}}
        ]
    }

    # Vector search with filter
    results = vector_db.query(
        embedding=embed(query),
        filter=acl_filter,
        top_k=top_k
    )
    return results

ACL Sync Strategy

Strategy	Description	Freshness	Complexity
Ingest-time ACL	Copy ACLs at indexing	Stale until re-index	Low
Query-time ACL check	Verify against source at query	Real-time	High latency
ACL cache + sync	Cache ACLs, sync periodically	Minutes-stale	Medium
Event-driven ACL	Update on permission change events	Near real-time	Medium-high

Recommendation: ACL cache with 5-15 minute sync interval for most enterprise use cases.

Hybrid Search

Enterprise queries range from exact error codes to broad conceptual questions -- no single search method handles both well. Hybrid search combines dense and sparse retrieval with metadata filtering to cover the full spectrum.

Architecture

Query
  │
  ├──> Dense Search (embedding similarity)  ──> Scores A
  │
  ├──> Sparse Search (BM25 / keyword)       ──> Scores B
  │
  └──> Metadata Filter (date, source, ACL)   ──> Scores C

  Fusion (RRF or weighted sum) ──> Reranker ──> Top K results

Reciprocal Rank Fusion

def rrf_fuse(result_lists: list[list], k: int = 60) -> list:
    """Fuse multiple ranked lists using Reciprocal Rank Fusion."""
    scores = {}
    for result_list in result_lists:
        for rank, doc in enumerate(result_list):
            scores[doc.id] = scores.get(doc.id, 0) + 1.0 / (k + rank + 1)

    sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_docs

# Usage
dense_results = vector_search(query, top_k=50)
sparse_results = bm25_search(query, top_k=50)
fused = rrf_fuse([dense_results, sparse_results])

Search Type Strengths

Query Type	Dense Best?	Sparse Best?	Example
Conceptual	Yes	No	"How to improve employee retention"
Exact term	No	Yes	"Error code ERR_SSL_PROTOCOL"
Acronym	No	Yes	"EBITDA calculation"
Synonym-rich	Yes	No	"laptop" finding "notebook computer"
Mixed	Both	Both	"Python memory leak fix"

Reranking in Production

Production reranking is a multi-stage funnel that balances retrieval breadth against latency constraints. Over-fetch candidates cheaply, then progressively narrow with more expensive models.

Stage	Candidates	Latency Budget	Method
Initial retrieval	50-100	< 100ms	ANN search
Rerank pass 1	20-50	< 200ms	Cross-encoder (light)
Rerank pass 2	5-10	< 500ms	LLM reranker (optional)
Final context	3-5	N/A	Top results to LLM

Citation and Source Attribution

Enterprise users need to verify answers against source documents -- an answer without a citation is an answer nobody trusts. Enforce citation in your prompt template and validate that cited sources actually exist.

Citation Pattern

PROMPT_TEMPLATE = """Answer the question using ONLY the provided sources.
For every claim, cite the source using [Source N] format.
If the answer is not in the sources, say "I don't have information about that."

Sources:
{formatted_sources}

Question: {question}
Answer:"""

def format_sources(chunks):
    formatted = []
    for i, chunk in enumerate(chunks, 1):
        formatted.append(
            f"[Source {i}] ({chunk.title}, {chunk.source_type}, "
            f"updated {chunk.updated_at})\n{chunk.text}"
        )
    return "\n\n".join(formatted)

Citation Verification

def verify_citations(answer: str, sources: list[str]) -> dict:
    """Check that cited sources exist and claims are grounded."""
    cited = re.findall(r'\[Source (\d+)\]', answer)
    valid_range = set(str(i) for i in range(1, len(sources) + 1))

    invalid = [c for c in cited if c not in valid_range]
    uncited_sentences = [s for s in split_sentences(answer)
                         if not re.search(r'\[Source \d+\]', s)
                         and len(s.split()) > 5]

    return {
        "citations_found": len(cited),
        "invalid_citations": invalid,
        "uncited_claims": uncited_sentences,
        "all_valid": len(invalid) == 0
    }

Freshness Management

Stale data produces wrong answers -- and in enterprise contexts, outdated policy or compliance information can create real liability. Your freshness strategy must match how frequently your source data changes.

Strategy	Mechanism	Use Case
Full re-index	Rebuild entire index periodically	Small corpus (< 50K docs)
Incremental update	Add/update changed docs only	Large corpus, frequent changes
TTL-based expiry	Remove chunks older than threshold	News, time-sensitive data
Change data capture	Stream changes from source	Database-backed sources
Webhook-triggered	Re-index on source system events	Confluence, GitHub

Freshness Boosting

def apply_freshness_boost(results, decay_factor=0.1):
    """Boost recent documents in search results."""
    now = datetime.utcnow()
    for result in results:
        age_days = (now - result.updated_at).days
        freshness_score = math.exp(-decay_factor * age_days)
        result.final_score = result.relevance_score * 0.8 + freshness_score * 0.2
    return sorted(results, key=lambda r: r.final_score, reverse=True)

Data Lineage

When an answer is wrong, you need to trace it back to the exact document, chunk strategy, and embedding model that produced it. Data lineage metadata makes debugging possible instead of guesswork.

Field	Purpose
`pipeline_version`	Which pipeline version processed this
`embedding_model`	Which model created the embedding
`chunk_strategy`	How the doc was chunked
`extraction_method`	PDF parser, OCR, etc.
`processing_timestamp`	When it was processed
`source_hash`	Detect if source changed

Scaling Considerations

Your architecture must match your document scale -- what works for 10K documents will collapse at 10M. Plan your infrastructure tier based on where you are today and where you expect to be in 12 months.

Scale	Docs	Architecture	Notes
Small	< 10K	Single node, ChromaDB/pgvector	Simple, low cost
Medium	10K-1M	Managed vector DB + BM25 index	Pinecone, Weaviate, or Qdrant
Large	1M-100M	Distributed vector DB + Elasticsearch	Sharding, replication
Very large	100M+	Multi-tier: cache + distributed search	CDN caching, query routing

Performance Optimization

Technique	Impact	Complexity
Query caching	High (avoid repeat searches)	Low
Embedding caching	Medium (avoid re-embedding)	Low
ANN index tuning (nprobe, ef)	Medium (speed vs recall)	Medium
Quantized vectors (binary, PQ)	High (memory reduction)	Medium
Async retrieval	Medium (parallel search)	Low
Pre-computed aggregations	High for common queries	Medium

Monitoring Metrics

Enterprise RAG systems degrade silently -- retrieval quality drops, indexes go stale, and ACL filters miss updates. These metrics are your early warning system for catching problems before users report them.

Metric	Target	Alert If
Retrieval latency (P95)	< 200ms	> 500ms
Reranking latency (P95)	< 300ms	> 800ms
End-to-end latency (P95)	< 3s	> 5s
Retrieval recall@10	> 0.8	< 0.6
Faithfulness score	> 0.9	< 0.7
Index freshness	< 1 hour	> 4 hours
ACL filter accuracy	100%	Any miss
Guardrail trigger rate	< 5%	> 15%

Common Pitfalls

Enterprise RAG failures tend to be more severe than basic RAG failures -- data leakage, compliance violations, and stale answers affecting business decisions. Audit your system against this list before going to production.

Pitfall	Problem	Fix
No ACL filtering	Data leakage between users	Implement ACL at retrieval time
Stale index	Wrong/outdated answers	Incremental re-indexing pipeline
Single search method	Misses keyword or semantic matches	Hybrid search (dense + sparse)
No reranking	Mediocre top-k quality	Add cross-encoder reranker
Missing citations	Users can't verify answers	Enforce citation in prompt
No monitoring	Silent quality degradation	Track retrieval + generation metrics
Flat chunking	Loses document structure	Use section-aware chunking with hierarchy
Ignoring document relationships	Missed cross-references	Add links/parent-child in metadata
No fallback for empty retrieval	"I don't know" without guidance	Add web search or escalation path