RAG retrieval

RAG quality depends on three factors: chunking granularity, retrieval strategy, and citation binding. This page provides a comparison of three chunking strategies, implementation code for BM25 + vector hybrid retrieval, citation format conventions, and an analysis of the common failure mode "retrieved but not used".

Each step in the RAG pipeline "chunk → embed → retrieve → rerank → assemble → generate" has its own independent failure modes. When debugging: first verify whether recall contains the correct answer chunk; then check whether reranking promotes it into top-k; only then tune the generation prompt.

RAG pipeline (conceptual)

The ASCII below is conceptual ordering; implementations may insert "query rewrite" or "permission filter" at different points, but metadata and citation ids must stay consistent end to end.

  [ Raw docs / API / crawl ]
              │
              ▼
        [ Clean: dedupe / version / license tags ]
              │
              ▼
        [ Chunk + metadata: title path / URI / paragraph anchors ]
              │
              ▼
        [ Embed into vector index ] ◄──── [ Sparse index: BM25, etc. ]
              │
              ▼
  ┌─────────────────────────────────────┐
  │  Query side: rewrite (opt) → hybrid top-N   │
  │           → cross-encoder rerank top-k     │
  │           → assemble in budget + cite table │
  └─────────────────────────────────────┘
              │
              ▼
        [ LLM: answer from chunks only / refuse / emit citations ]
  • Tables, code blocks, and lists often deserve separate chunking rules to avoid semantic breaks when mixed with prose.
  • Embedding model language should match the corpus; for cross-lingual cases label explicitly or split indexes.

Chunking strategy comparison — when to use each

Three mainstream chunking strategies each have their appropriate use cases; parameter choice directly affects retrieval hit rate:

┌──────────────────┬────────────────┬──────────────────────────┬────────────────────────────┐
│ Strategy          │ chunk_size     │ Suitable for             │ Main risk                  │
├──────────────────┼────────────────┼──────────────────────────┼────────────────────────────┤
│ Fixed-size        │ 512-1024 tokens│ Plain text, logs,        │ Truncates sentences/paras; │
│                   │ overlap: 10%   │ unstructured docs        │ semantic breaks            │
├──────────────────┼────────────────┼──────────────────────────┼────────────────────────────┤
│ Semantic          │ Adaptive       │ Articles, papers,        │ Uneven chunk sizes; large  │
│ (NLTK/spaCy)      │ avg 300-600 tok│ paragraph-structured docs│ chunks consume more tokens │
├──────────────────┼────────────────┼──────────────────────────┼────────────────────────────┤
│ Recursive         │ Split on \n\n  │ Markdown/HTML/codebases, │ Requires maintaining a     │
│ (LangChain        │ then \n, space │ heading-structured docs  │ separator priority list    │
│  RecursiveTC)     │ max 1024 tokens│                          │                            │
└──────────────────┴────────────────┴──────────────────────────┴────────────────────────────┘

# Recommended parameters (Python / LangChain):
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,        # characters; ~500-800 for CJK docs, ~800-1200 for English
    chunk_overlap=80,      # overlap = chunk_size × 10%; reduces boundary truncation
    separators=["\n\n", "\n", ".", " ", ""],  # priority from high to low
    length_function=len,   # switch to tiktoken for exact token counts
)

# Required metadata fields:
{
  "source": "docs/install-guide.md",    # document path
  "section": "§3.2 Environment Variables",  # heading path (H1 > H2 > H3)
  "chunk_index": 7,                     # chunk order within the document
  "updated_at": "2024-03-01",           # document update time; used for stale filtering
  "doc_version": "v2.4.1"               # document version
}
  • Overlap calculation: with chunk_size=800, overlap=80 (10%) is the empirical minimum. Increase to 15% (120 chars) for prose-heavy CJK documents since each sentence carries higher information density.
  • Tables/code blocks: treat as a single chunk with no internal splitting; for overlong ones, prepend a comment # (truncated, full at {uri}) before truncating.

Retrieval and reranking

Pure vector search can miss exact terms and proper nouns; sparse keyword search can be noisy. Common pattern: take candidates from both paths, fuse scores, then rerank top-N with a small cross-encoder and pass only top-k to generation.

  • Query rewrite: multi-query expansion or HyDE when appropriate; document latency, cost, and drift in the SKILL.
  • Fusion: RRF, weighted linear, or learned rank; log per-path contribution for tuning.
  • Reranking: fix cross-encoder batch and truncation length so online matches offline eval.
  • Empty results: define fallback (broaden recall, switch index, ask the user to narrow) and monitor "zero-hit rate".
Debug order: first check whether recall contains the correct chunk; then whether reranking promotes it into top-k; only then tune the prompt.

Citations and generation constraints

Instruct the model to answer only from retrieved chunks; bind each claim to a citation id aligned with vector metadata. When evidence is insufficient, emit structured refusal instead of fabrication.

  • Citation format: e.g. [doc_id § section] or [^chunk_uuid], matching clickable anchors in the UI.
  • Conflicting chunks: if retrieved results contradict each other, require the model to list contradictions and lower certainty.
  • SKILL checklist: index refresh cadence, permission filters, PII redaction rules, whether summaries may leave the org under license.

Chunk character estimator

Front-end only: use common heuristics to convert target tokens per chunk into rough character caps, and naive fixed-character sliding windows to count chunks on sample text (no real tokenizer—initial config only).

Chunks: — (enter text first)

Token→char shown for English ~1∶3.5–4 and CJK-heavy ~1∶1.2–1.8; trust your production tokenizer.

Evaluation and operations

Offline: hit rate, MRR, nDCG, human relevance spot checks; online: zero-hit rate, refusal rate, citation click-through, user corrections. Guard against silent index drift (e.g. embedding model swap without full re-embed).

  • Query rewrite: log recall delta before/after rewrite to catch "rewrite amplifies drift".
  • Versioning: index schema, embedding model, reranker versions in release notes and rollback plans.
---
name: rag-retrieval-pipeline
description: Design or review RAG retrieval and citation flow
---
# Steps
1. Chunking, metadata, index updates
2. Retrieval: vector / keyword / rerank
3. Generation: citation format, refusal, evaluation

Back to skills More skills