RAG retrieval

RAG quality depends on three factors: chunking granularity, retrieval strategy, and citation binding. This page provides a comparison of three chunking strategies, implementation code for BM25 + vector hybrid retrieval, citation format conventions, and an analysis of the common failure mode "retrieved but not used".

Each step in the RAG pipeline "chunk → embed → retrieve → rerank → assemble → generate" has its own independent failure modes. When debugging: first verify whether recall contains the correct answer chunk; then check whether reranking promotes it into top-k; only then tune the generation prompt.

RAG pipeline (conceptual)

The ASCII below is conceptual ordering; implementations may insert "query rewrite" or "permission filter" at different points, but metadata and citation ids must stay consistent end to end.

  [ Raw docs / API / crawl ]
              │
              ▼
        [ Clean: dedupe / version / license tags ]
              │
              ▼
        [ Chunk + metadata: title path / URI / paragraph anchors ]
              │
              ▼
        [ Embed into vector index ] ◄──── [ Sparse index: BM25, etc. ]
              │
              ▼
  ┌─────────────────────────────────────┐
  │  Query side: rewrite (opt) → hybrid top-N   │
  │           → cross-encoder rerank top-k     │
  │           → assemble in budget + cite table │
  └─────────────────────────────────────┘
              │
              ▼
        [ LLM: answer from chunks only / refuse / emit citations ]

Tables, code blocks, and lists often deserve separate chunking rules to avoid semantic breaks when mixed with prose.
Embedding model language should match the corpus; for cross-lingual cases label explicitly or split indexes.

Chunking strategy comparison — when to use each

Three mainstream chunking strategies each have their appropriate use cases; parameter choice directly affects retrieval hit rate:

┌──────────────────┬────────────────┬──────────────────────────┬────────────────────────────┐
│ Strategy          │ chunk_size     │ Suitable for             │ Main risk                  │
├──────────────────┼────────────────┼──────────────────────────┼────────────────────────────┤
│ Fixed-size        │ 512-1024 tokens│ Plain text, logs,        │ Truncates sentences/paras; │
│                   │ overlap: 10%   │ unstructured docs        │ semantic breaks            │
├──────────────────┼────────────────┼──────────────────────────┼────────────────────────────┤
│ Semantic          │ Adaptive       │ Articles, papers,        │ Uneven chunk sizes; large  │
│ (NLTK/spaCy)      │ avg 300-600 tok│ paragraph-structured docs│ chunks consume more tokens │
├──────────────────┼────────────────┼──────────────────────────┼────────────────────────────┤
│ Recursive         │ Split on \n\n  │ Markdown/HTML/codebases, │ Requires maintaining a     │
│ (LangChain        │ then \n, space │ heading-structured docs  │ separator priority list    │
│  RecursiveTC)     │ max 1024 tokens│                          │                            │
└──────────────────┴────────────────┴──────────────────────────┴────────────────────────────┘

# Recommended parameters (Python / LangChain):
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,        # characters; ~500-800 for CJK docs, ~800-1200 for English
    chunk_overlap=80,      # overlap = chunk_size × 10%; reduces boundary truncation
    separators=["\n\n", "\n", ".", " ", ""],  # priority from high to low
    length_function=len,   # switch to tiktoken for exact token counts
)

# Required metadata fields:
{
  "source": "docs/install-guide.md",    # document path
  "section": "§3.2 Environment Variables",  # heading path (H1 > H2 > H3)
  "chunk_index": 7,                     # chunk order within the document
  "updated_at": "2024-03-01",           # document update time; used for stale filtering
  "doc_version": "v2.4.1"               # document version
}

Overlap calculation: with chunk_size=800, overlap=80 (10%) is the empirical minimum. Increase to 15% (120 chars) for prose-heavy CJK documents since each sentence carries higher information density.
Tables/code blocks: treat as a single chunk with no internal splitting; for overlong ones, prepend a comment # (truncated, full at {uri}) before truncating.

Retrieval and reranking

Pure vector search can miss exact terms and proper nouns; sparse keyword search can be noisy. Common pattern: take candidates from both paths, fuse scores, then rerank top-N with a small cross-encoder and pass only top-k to generation.

Query rewrite: multi-query expansion or HyDE when appropriate; document latency, cost, and drift in the SKILL.
Fusion: RRF, weighted linear, or learned rank; log per-path contribution for tuning.
Reranking: fix cross-encoder batch and truncation length so online matches offline eval.
Empty results: define fallback (broaden recall, switch index, ask the user to narrow) and monitor "zero-hit rate".

Debug order: first check whether recall contains the correct chunk; then whether reranking promotes it into top-k; only then tune the prompt.

Citations and generation constraints

Instruct the model to answer only from retrieved chunks; bind each claim to a citation id aligned with vector metadata. When evidence is insufficient, emit structured refusal instead of fabrication.

Citation format: e.g. [doc_id § section] or [^chunk_uuid], matching clickable anchors in the UI.
Conflicting chunks: if retrieved results contradict each other, require the model to list contradictions and lower certainty.
SKILL checklist: index refresh cadence, permission filters, PII redaction rules, whether summaries may leave the org under license.

Chunk character estimator

Front-end only: use common heuristics to convert target tokens per chunk into rough character caps, and naive fixed-character sliding windows to count chunks on sample text (no real tokenizer—initial config only).

Target tokens per chunk (heuristic)

Fixed character window (naive chunks)

Character overlap

Sample text (chunk count)

Chunks: — (enter text first)

Token→char shown for English ~1∶3.5–4 and CJK-heavy ~1∶1.2–1.8; trust your production tokenizer.

Evaluation and operations

Offline: hit rate, MRR, nDCG, human relevance spot checks; online: zero-hit rate, refusal rate, citation click-through, user corrections. Guard against silent index drift (e.g. embedding model swap without full re-embed).

Query rewrite: log recall delta before/after rewrite to catch "rewrite amplifies drift".
Versioning: index schema, embedding model, reranker versions in release notes and rollback plans.

---
name: rag-retrieval-pipeline
description: Design or review RAG retrieval and citation flow
---
# Steps
1. Chunking, metadata, index updates
2. Retrieval: vector / keyword / rerank
3. Generation: citation format, refusal, evaluation

Back to skills More skills