RAG retrieval
RAG quality depends on three factors: chunking granularity, retrieval strategy, and citation binding. This page provides a comparison of three chunking strategies, implementation code for BM25 + vector hybrid retrieval, citation format conventions, and an analysis of the common failure mode "retrieved but not used".
Each step in the RAG pipeline "chunk → embed → retrieve → rerank → assemble → generate" has its own independent failure modes. When debugging: first verify whether recall contains the correct answer chunk; then check whether reranking promotes it into top-k; only then tune the generation prompt.
RAG pipeline (conceptual)
The ASCII below is conceptual ordering; implementations may insert "query rewrite" or "permission filter" at different points, but metadata and citation ids must stay consistent end to end.
[ Raw docs / API / crawl ]
│
▼
[ Clean: dedupe / version / license tags ]
│
▼
[ Chunk + metadata: title path / URI / paragraph anchors ]
│
▼
[ Embed into vector index ] ◄──── [ Sparse index: BM25, etc. ]
│
▼
┌─────────────────────────────────────┐
│ Query side: rewrite (opt) → hybrid top-N │
│ → cross-encoder rerank top-k │
│ → assemble in budget + cite table │
└─────────────────────────────────────┘
│
▼
[ LLM: answer from chunks only / refuse / emit citations ]
- Tables, code blocks, and lists often deserve separate chunking rules to avoid semantic breaks when mixed with prose.
- Embedding model language should match the corpus; for cross-lingual cases label explicitly or split indexes.
Chunking strategy comparison — when to use each
Three mainstream chunking strategies each have their appropriate use cases; parameter choice directly affects retrieval hit rate:
┌──────────────────┬────────────────┬──────────────────────────┬────────────────────────────┐
│ Strategy │ chunk_size │ Suitable for │ Main risk │
├──────────────────┼────────────────┼──────────────────────────┼────────────────────────────┤
│ Fixed-size │ 512-1024 tokens│ Plain text, logs, │ Truncates sentences/paras; │
│ │ overlap: 10% │ unstructured docs │ semantic breaks │
├──────────────────┼────────────────┼──────────────────────────┼────────────────────────────┤
│ Semantic │ Adaptive │ Articles, papers, │ Uneven chunk sizes; large │
│ (NLTK/spaCy) │ avg 300-600 tok│ paragraph-structured docs│ chunks consume more tokens │
├──────────────────┼────────────────┼──────────────────────────┼────────────────────────────┤
│ Recursive │ Split on \n\n │ Markdown/HTML/codebases, │ Requires maintaining a │
│ (LangChain │ then \n, space │ heading-structured docs │ separator priority list │
│ RecursiveTC) │ max 1024 tokens│ │ │
└──────────────────┴────────────────┴──────────────────────────┴────────────────────────────┘
# Recommended parameters (Python / LangChain):
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # characters; ~500-800 for CJK docs, ~800-1200 for English
chunk_overlap=80, # overlap = chunk_size × 10%; reduces boundary truncation
separators=["\n\n", "\n", ".", " ", ""], # priority from high to low
length_function=len, # switch to tiktoken for exact token counts
)
# Required metadata fields:
{
"source": "docs/install-guide.md", # document path
"section": "§3.2 Environment Variables", # heading path (H1 > H2 > H3)
"chunk_index": 7, # chunk order within the document
"updated_at": "2024-03-01", # document update time; used for stale filtering
"doc_version": "v2.4.1" # document version
}
- Overlap calculation: with chunk_size=800, overlap=80 (10%) is the empirical minimum. Increase to 15% (120 chars) for prose-heavy CJK documents since each sentence carries higher information density.
- Tables/code blocks: treat as a single chunk with no internal splitting; for overlong ones, prepend a comment
# (truncated, full at {uri})before truncating.
Retrieval and reranking
Pure vector search can miss exact terms and proper nouns; sparse keyword search can be noisy. Common pattern: take candidates from both paths, fuse scores, then rerank top-N with a small cross-encoder and pass only top-k to generation.
- Query rewrite: multi-query expansion or HyDE when appropriate; document latency, cost, and drift in the SKILL.
- Fusion: RRF, weighted linear, or learned rank; log per-path contribution for tuning.
- Reranking: fix cross-encoder batch and truncation length so online matches offline eval.
- Empty results: define fallback (broaden recall, switch index, ask the user to narrow) and monitor "zero-hit rate".
Citations and generation constraints
Instruct the model to answer only from retrieved chunks; bind each claim to a citation id aligned with vector metadata. When evidence is insufficient, emit structured refusal instead of fabrication.
- Citation format: e.g.
[doc_id § section]or[^chunk_uuid], matching clickable anchors in the UI. - Conflicting chunks: if retrieved results contradict each other, require the model to list contradictions and lower certainty.
- SKILL checklist: index refresh cadence, permission filters, PII redaction rules, whether summaries may leave the org under license.
Chunk character estimator
Front-end only: use common heuristics to convert target tokens per chunk into rough character caps, and naive fixed-character sliding windows to count chunks on sample text (no real tokenizer—initial config only).
Chunks: — (enter text first)
Token→char shown for English ~1∶3.5–4 and CJK-heavy ~1∶1.2–1.8; trust your production tokenizer.
Evaluation and operations
Offline: hit rate, MRR, nDCG, human relevance spot checks; online: zero-hit rate, refusal rate, citation click-through, user corrections. Guard against silent index drift (e.g. embedding model swap without full re-embed).
- Query rewrite: log recall delta before/after rewrite to catch "rewrite amplifies drift".
- Versioning: index schema, embedding model, reranker versions in release notes and rollback plans.
---
name: rag-retrieval-pipeline
description: Design or review RAG retrieval and citation flow
---
# Steps
1. Chunking, metadata, index updates
2. Retrieval: vector / keyword / rerank
3. Generation: citation format, refusal, evaluation