Quick Reference 08

Enterprise RAG

Quick reference for production RAG systems: ingestion, ACL filtering, hybrid search, citation, freshness, and scaling.

8 min readAI ArchitectureQuick ReferenceDownload PDF

Enterprise RAG vs Basic RAG

A demo RAG pipeline and a production enterprise RAG system are fundamentally different in scope. This comparison shows exactly where basic RAG falls short and what you need to add for real-world deployment.

AspectBasic RAGEnterprise RAG
Data sourcesSingle sourceMulti-source, multi-format
Access controlNonePer-document ACLs
SearchDense vector onlyHybrid (dense + sparse + metadata)
RankingTop-k by similarityReranking + business rules
FreshnessStatic indexIncremental updates, TTL
CitationNone or approximateExact source + page/section
ScaleThousands of docsMillions of docs
MonitoringNoneRetrieval quality, latency, cost
Data lineageNoneFull provenance tracking

Ingestion Pipeline

The ingestion pipeline is the foundation of your entire RAG system -- garbage in means garbage out at query time. Getting document processing, metadata extraction, and ACL mapping right during ingestion prevents every downstream problem.

Data Sources           Processing              Storage
┌──────────┐     ┌──────────────────┐     ┌─────────────┐
│ SharePoint│     │ Format Detection │     │ Vector DB   │
│ Confluence│────>│ Text Extraction  │────>│ (embeddings)│
│ Google Dr │     │ Chunking         │     ├─────────────┤
│ S3 / GCS  │     │ Metadata Extract │     │ Search Index│
│ Databases │     │ ACL Mapping      │     │ (BM25/FTS)  │
│ APIs      │     │ Embedding        │     ├─────────────┤
└──────────┘     │ Quality Filter   │     │ Metadata DB │
                  └──────────────────┘     │ (ACLs, etc) │
                                           └─────────────┘

Document Processing by Format

FormatExtraction ToolNotes
PDFPyMuPDF, Unstructured, Amazon TextractOCR for scanned docs
DOCX/PPTXpython-docx, python-pptx, UnstructuredPreserve structure
HTMLBeautifulSoup, TrafilaturaStrip boilerplate
MarkdownDirect parsePreserve headers as metadata
CSV/ExcelpandasRow-level or table-level chunks
ImagesOCR (Tesseract, Textract) + VLMVision model for diagrams
Audio/VideoWhisper transcriptionTimestamps as metadata
Confluence/WikiAPI extractionPage hierarchy as context
CodeAST parsingFunction/class level

Metadata to Extract and Store

FieldPurposeExample
source_idUnique document identifierconfluence://page/12345
titleDisplay and search"Q3 Revenue Report"
source_typeFormat/originpdf, confluence, slack
created_atFreshness sorting2025-11-15T10:00:00Z
updated_atFreshness, re-indexing2026-01-20T14:30:00Z
authorAttribution"jane.doe@company.com"
acl_groupsAccess control["engineering", "all-staff"]
acl_usersAccess control["user123"]
departmentFiltering"Engineering"
chunk_indexOrdering within doc3 (of 12)
parent_idParent-child linkingdoc_abc_chunk_0
confidenceExtraction quality0.95

Access Control (ACL) Filtering

In enterprise environments, returning a document the user should not see is a security incident, not a UX bug. ACL filtering must be enforced at the retrieval layer, not as a post-processing step.

Pattern: Pre-filter at Query Time

def search_with_acl(query: str, user: User, top_k: int = 10):
    # Get user's access groups
    user_groups = get_user_groups(user.id)

    # Build metadata filter
    acl_filter = {
        "$or": [
            {"acl_groups": {"$in": user_groups}},
            {"acl_users": {"$in": [user.id]}},
            {"acl_groups": {"$in": ["public"]}}
        ]
    }

    # Vector search with filter
    results = vector_db.query(
        embedding=embed(query),
        filter=acl_filter,
        top_k=top_k
    )
    return results

ACL Sync Strategy

StrategyDescriptionFreshnessComplexity
Ingest-time ACLCopy ACLs at indexingStale until re-indexLow
Query-time ACL checkVerify against source at queryReal-timeHigh latency
ACL cache + syncCache ACLs, sync periodicallyMinutes-staleMedium
Event-driven ACLUpdate on permission change eventsNear real-timeMedium-high

Recommendation: ACL cache with 5-15 minute sync interval for most enterprise use cases.

Enterprise queries range from exact error codes to broad conceptual questions -- no single search method handles both well. Hybrid search combines dense and sparse retrieval with metadata filtering to cover the full spectrum.

Architecture

Query
  │
  ├──> Dense Search (embedding similarity)  ──> Scores A
  │
  ├──> Sparse Search (BM25 / keyword)       ──> Scores B
  │
  └──> Metadata Filter (date, source, ACL)   ──> Scores C

  Fusion (RRF or weighted sum) ──> Reranker ──> Top K results

Reciprocal Rank Fusion

def rrf_fuse(result_lists: list[list], k: int = 60) -> list:
    """Fuse multiple ranked lists using Reciprocal Rank Fusion."""
    scores = {}
    for result_list in result_lists:
        for rank, doc in enumerate(result_list):
            scores[doc.id] = scores.get(doc.id, 0) + 1.0 / (k + rank + 1)

    sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_docs

# Usage
dense_results = vector_search(query, top_k=50)
sparse_results = bm25_search(query, top_k=50)
fused = rrf_fuse([dense_results, sparse_results])

Search Type Strengths

Query TypeDense Best?Sparse Best?Example
ConceptualYesNo"How to improve employee retention"
Exact termNoYes"Error code ERR_SSL_PROTOCOL"
AcronymNoYes"EBITDA calculation"
Synonym-richYesNo"laptop" finding "notebook computer"
MixedBothBoth"Python memory leak fix"

Reranking in Production

Production reranking is a multi-stage funnel that balances retrieval breadth against latency constraints. Over-fetch candidates cheaply, then progressively narrow with more expensive models.

StageCandidatesLatency BudgetMethod
Initial retrieval50-100< 100msANN search
Rerank pass 120-50< 200msCross-encoder (light)
Rerank pass 25-10< 500msLLM reranker (optional)
Final context3-5N/ATop results to LLM

Citation and Source Attribution

Enterprise users need to verify answers against source documents -- an answer without a citation is an answer nobody trusts. Enforce citation in your prompt template and validate that cited sources actually exist.

Citation Pattern

PROMPT_TEMPLATE = """Answer the question using ONLY the provided sources.
For every claim, cite the source using [Source N] format.
If the answer is not in the sources, say "I don't have information about that."

Sources:
{formatted_sources}

Question: {question}
Answer:"""

def format_sources(chunks):
    formatted = []
    for i, chunk in enumerate(chunks, 1):
        formatted.append(
            f"[Source {i}] ({chunk.title}, {chunk.source_type}, "
            f"updated {chunk.updated_at})\n{chunk.text}"
        )
    return "\n\n".join(formatted)

Citation Verification

def verify_citations(answer: str, sources: list[str]) -> dict:
    """Check that cited sources exist and claims are grounded."""
    cited = re.findall(r'\[Source (\d+)\]', answer)
    valid_range = set(str(i) for i in range(1, len(sources) + 1))

    invalid = [c for c in cited if c not in valid_range]
    uncited_sentences = [s for s in split_sentences(answer)
                         if not re.search(r'\[Source \d+\]', s)
                         and len(s.split()) > 5]

    return {
        "citations_found": len(cited),
        "invalid_citations": invalid,
        "uncited_claims": uncited_sentences,
        "all_valid": len(invalid) == 0
    }

Freshness Management

Stale data produces wrong answers -- and in enterprise contexts, outdated policy or compliance information can create real liability. Your freshness strategy must match how frequently your source data changes.

StrategyMechanismUse Case
Full re-indexRebuild entire index periodicallySmall corpus (< 50K docs)
Incremental updateAdd/update changed docs onlyLarge corpus, frequent changes
TTL-based expiryRemove chunks older than thresholdNews, time-sensitive data
Change data captureStream changes from sourceDatabase-backed sources
Webhook-triggeredRe-index on source system eventsConfluence, GitHub

Freshness Boosting

def apply_freshness_boost(results, decay_factor=0.1):
    """Boost recent documents in search results."""
    now = datetime.utcnow()
    for result in results:
        age_days = (now - result.updated_at).days
        freshness_score = math.exp(-decay_factor * age_days)
        result.final_score = result.relevance_score * 0.8 + freshness_score * 0.2
    return sorted(results, key=lambda r: r.final_score, reverse=True)

Data Lineage

When an answer is wrong, you need to trace it back to the exact document, chunk strategy, and embedding model that produced it. Data lineage metadata makes debugging possible instead of guesswork.

FieldPurpose
pipeline_versionWhich pipeline version processed this
embedding_modelWhich model created the embedding
chunk_strategyHow the doc was chunked
extraction_methodPDF parser, OCR, etc.
processing_timestampWhen it was processed
source_hashDetect if source changed

Scaling Considerations

Your architecture must match your document scale -- what works for 10K documents will collapse at 10M. Plan your infrastructure tier based on where you are today and where you expect to be in 12 months.

ScaleDocsArchitectureNotes
Small< 10KSingle node, ChromaDB/pgvectorSimple, low cost
Medium10K-1MManaged vector DB + BM25 indexPinecone, Weaviate, or Qdrant
Large1M-100MDistributed vector DB + ElasticsearchSharding, replication
Very large100M+Multi-tier: cache + distributed searchCDN caching, query routing

Performance Optimization

TechniqueImpactComplexity
Query cachingHigh (avoid repeat searches)Low
Embedding cachingMedium (avoid re-embedding)Low
ANN index tuning (nprobe, ef)Medium (speed vs recall)Medium
Quantized vectors (binary, PQ)High (memory reduction)Medium
Async retrievalMedium (parallel search)Low
Pre-computed aggregationsHigh for common queriesMedium

Monitoring Metrics

Enterprise RAG systems degrade silently -- retrieval quality drops, indexes go stale, and ACL filters miss updates. These metrics are your early warning system for catching problems before users report them.

MetricTargetAlert If
Retrieval latency (P95)< 200ms> 500ms
Reranking latency (P95)< 300ms> 800ms
End-to-end latency (P95)< 3s> 5s
Retrieval recall@10> 0.8< 0.6
Faithfulness score> 0.9< 0.7
Index freshness< 1 hour> 4 hours
ACL filter accuracy100%Any miss
Guardrail trigger rate< 5%> 15%

Common Pitfalls

Enterprise RAG failures tend to be more severe than basic RAG failures -- data leakage, compliance violations, and stale answers affecting business decisions. Audit your system against this list before going to production.

PitfallProblemFix
No ACL filteringData leakage between usersImplement ACL at retrieval time
Stale indexWrong/outdated answersIncremental re-indexing pipeline
Single search methodMisses keyword or semantic matchesHybrid search (dense + sparse)
No rerankingMediocre top-k qualityAdd cross-encoder reranker
Missing citationsUsers can't verify answersEnforce citation in prompt
No monitoringSilent quality degradationTrack retrieval + generation metrics
Flat chunkingLoses document structureUse section-aware chunking with hierarchy
Ignoring document relationshipsMissed cross-referencesAdd links/parent-child in metadata
No fallback for empty retrieval"I don't know" without guidanceAdd web search or escalation path