Key Takeaways
- RAG pipeline chunking strategies determine retrieval quality more than the embedding model or vector store — most recall failures trace back to how documents were split during ingestion
- Fixed-size chunking (256–512 tokens with 10–15% overlap) is the right starting point for homogeneous prose; semantic and structural strategies outperform it on technical docs and mixed-format corpora
- Hierarchical (parent-child) chunking is the highest-performance approach for production systems: small chunks for precise vector retrieval, large parent chunks for full context delivery to the LLM
- Always evaluate chunking changes against a golden retrieval set (30–50 annotated queries) before shipping — target recall@3 above 80% before adjusting the embedding model or prompt
The fastest way to diagnose a RAG pipeline returning wrong answers is not to inspect the prompt or swap the LLM — it is to look at what your vector store is actually retrieving. In most production failures we diagnose, the correct information exists in the corpus. It was just chunked in a way that makes it unretrievable.
RAG pipeline chunking strategies determine whether your vector store finds the right context or retrieves noise. The four production-relevant approaches — fixed-size, semantic, structural, and hierarchical — each trade document coverage against retrieval precision differently. Your corpus type, query profile, and LLM context budget determine which one fits.
Why Chunking Is the Primary Source of RAG Retrieval Failures
Most RAG pipeline failures in production trace back to chunking decisions made during ingestion — not to the LLM, not to the embedding model, and not to the vector store. When retrieval recall drops after launch, the right information usually exists in the corpus but was split across chunk boundaries or embedded with context that dilutes its semantic signal.
The mechanism is straightforward. Embedding models map text to a fixed-size vector that encodes semantic meaning. When a chunk contains a complete, coherent thought — a sentence, a paragraph, a documentation section — the resulting vector is a clean representation of that idea. When a chunk cuts mid-sentence, mixes unrelated topics, or spans three distinct concepts, the vector averages across those signals and becomes a poor match for any specific query.
Three chunking failure modes appear most often in production:
Boundary truncation. A sentence containing the answer to a query is split across two chunks. Neither chunk retrieves on its own; together they would answer, but the vector store never sees them together.
Context dilution. A 1,024-token fixed chunk contains one highly relevant paragraph and five unrelated ones. The relevant passage's signal is averaged into the surrounding noise, and the cosine similarity score drops below the retrieval threshold.
Missing metadata. Chunks that are otherwise well-sized carry no metadata about their source section, document type, or date. Metadata-filtered retrieval — essential for multi-tenant or time-sensitive corpora — cannot work without it.
Fixed-Size Chunking: The Right Starting Point
Fixed-size chunking splits documents by token count — typically 256 to 1,024 tokens with configurable overlap — regardless of sentence or paragraph boundaries. It is the default in most RAG frameworks, fast to implement with tiktoken or LangChain's RecursiveCharacterTextSplitter, and predictable in its output distribution.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64, # ~12% overlap
length_function=len,
)
chunks = splitter.split_text(document)
When it works. Fixed-size chunking performs well on homogeneous prose — financial reports, research papers, long-form articles — where paragraphs flow continuously and natural section breaks are sparse. It also works when your query profile is general rather than fact-specific.
When it fails. Fixed-size chunking degrades on structured documents — technical documentation, code-heavy wikis, PDFs with tables — where arbitrary token splits regularly land mid-sentence or mid-table. It also underperforms when queries are highly specific: a 512-token chunk that contains the one sentence you need plus 490 tokens of unrelated context will rank below a well-targeted 128-token semantic chunk.
Chunk size guidance:
| Chunk size | Best for | Trade-off |
|---|---|---|
| 128–256 tokens | Fact-lookup queries, dense technical docs | More chunks, higher index cost |
| 256–512 tokens | General-purpose starting point | Balanced precision and context |
| 512–1,024 tokens | Long-form analytical questions | Risk of context dilution |
Set overlap to 10–15% of chunk size. Below 10%, boundary truncation increases; above 20%, index inflation outweighs the recall benefit.
Semantic and Structural Chunking: Respecting Document Boundaries
Semantic chunking splits on sentence or paragraph boundaries rather than arbitrary token counts, preserving the linguistic units that embedding models were trained to represent. LangChain's SemanticChunker uses embedding distance between consecutive sentences to detect topic shifts; LlamaIndex's SentenceSplitter respects sentence endings with a configurable maximum chunk size.
Sentence-level semantic chunking is most valuable when documents contain short, high-density sentences where every boundary matters — FAQ pages, support knowledge bases, product documentation. The resulting chunks are variable in size but semantically coherent, which tends to produce better cosine similarity matching for short, precise queries.
Structural (header-based) chunking splits on document structure — Markdown headers, HTML headings, or PDF section markers — rather than semantic signals. LangChain's MarkdownHeaderTextSplitter splits on #, ##, and ### boundaries and propagates the header hierarchy as chunk metadata:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)
# Each chunk carries metadata: {"h1": "Installation", "h2": "Prerequisites"}
This metadata is the key advantage: downstream retrieval can filter by section (h2 == "API Reference") before running vector search, dramatically improving precision on structured technical corpora like developer documentation or internal wikis.
When to use structural over semantic: if your documents have consistent heading structure, structural chunking almost always outperforms semantic splitting on precision. Use semantic splitting when documents are heading-free prose — support tickets, email threads, freeform notes.
Hierarchical Chunking: Precision Retrieval with Full Context
Hierarchical chunking stores two representations of every document segment: a small chunk (64–128 tokens) for precise retrieval, and a larger parent chunk (512–1,024 tokens) for full context delivery to the LLM. At query time, the vector store retrieves the small chunk, then the system fetches its parent before passing it to the model.
This solves the core tension in chunking: small chunks produce more precise vector retrieval, but pass too little context to the LLM for it to synthesize a complete answer. Large chunks provide full context, but their diluted embeddings underperform in retrieval. Hierarchical chunking decouples the two concerns.
LangChain's ParentDocumentRetriever implements this out of the box:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=100)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
The production advantage. In Prodinit's production RAG deployments, switching from flat 512-token fixed chunking to a hierarchical 128-token child / 768-token parent setup consistently improves recall@3 — especially for queries that require synthesizing information across a section rather than retrieving a single sentence. The improvement is most pronounced on long technical documents and multi-paragraph policy content.
Trade-offs. Hierarchical chunking doubles the storage footprint (parent and child chunks coexist) and adds a fetch step at retrieval time. For corpora under 100K documents, the latency delta is negligible — typically 20–40ms of extra docstore fetch on top of the vector search. For very large corpora, this second fetch can become a bottleneck if the docstore (InMemoryStore, Redis, or PostgreSQL) is not co-located with the retriever.
RAG Pipeline Chunking Strategies: How to Choose
Choosing the right RAG pipeline chunking strategy depends on three variables: your document type and structure, your query profile (short fact lookups vs. long analytical questions), and your LLM's context budget. No single strategy wins across all corpora — the decision is empirical, and measuring retrieval recall on a golden dataset before committing is non-negotiable.
| Corpus type | Recommended strategy | Starting chunk size |
|---|---|---|
| Homogeneous prose (reports, articles) | Fixed-size | 512 tokens, 10% overlap |
| Structured technical docs (Markdown, HTML) | Structural (header-based) | Per section + 512 sub-chunk |
| Mixed-format documents | Hierarchical parent-child | 128 child / 768 parent |
| Short-form dense content (FAQs, support) | Semantic (sentence-level) | Variable, max 256 tokens |
| Multi-tenant or time-sensitive corpora | Structural + metadata filters | Per section with timestamp/tenant metadata |
The evaluation loop that matters: before finalising any chunking strategy, build a golden retrieval set of 30–50 representative queries, annotate the correct source passages, and measure recall@3 (does the correct chunk appear in the top 3 results?). A well-configured chunking strategy on your specific corpus should reach recall@3 above 80% before you start tuning the embedding model, adjusting similarity thresholds, or rewriting prompts.
Chunking decisions made during ingestion are the hardest to change in production — they require re-embedding and re-indexing the entire corpus. Getting them right before launch is significantly cheaper than fixing retrieval quality drift six weeks in. The five failure modes in production RAG systems covers what breaks next after chunking, including stale embeddings and query-document mismatch.
Frequently Asked Questions
What is the best chunk size for a RAG pipeline?
For most RAG use cases, 256–512 tokens per chunk is the practical starting point. Smaller chunks (128–256 tokens) improve precision for fact-lookup queries but risk losing surrounding context; larger chunks (512–1,024 tokens) preserve context but dilute the embedding signal. Test against your query distribution on a golden retrieval set before fixing the chunk size in production.
How does chunk overlap work and how much should I use?
Chunk overlap copies a token slice from the end of one chunk to the start of the next, ensuring sentences spanning a boundary appear in at least one retrievable unit. A 10–15% overlap — 25–75 tokens on a 512-token chunk — is the standard starting point. Too much overlap inflates your vector store without proportional recall gains.
What chunking strategy works best for Markdown and technical documentation?
Header-based structural chunking is the strongest default for Markdown and technical docs. LangChain's MarkdownHeaderTextSplitter splits on #, ##, and ### boundaries and propagates the header hierarchy as chunk metadata, enabling metadata-filtered retrieval by section. Pair it with 512-token sub-chunking inside each header section to prevent oversized chunks from diluting embedding precision.
Does chunk size affect embedding model performance?
Yes — embedding models have an effective input range within which they produce the most meaningful vectors. OpenAI's text-embedding-3-small and text-embedding-3-large accept up to 8,191 tokens, but retrieval precision typically peaks at 256–512 token inputs. Very long chunks force the model to average semantic signal across too much text, reducing the distinctiveness of the resulting vector and lowering cosine similarity scores at retrieval time.
How do you evaluate whether a chunking strategy is working?
Build a golden retrieval set: 30–50 representative queries with annotated correct source passages. For each query, measure whether the correct passage appears in the top-k retrieved chunks (recall@k). A well-chunked corpus should achieve recall@3 above 80% for your query distribution. If it does not, adjust chunk size, overlap, or strategy before touching the embedding model or prompt.













