Try finding one recipe step inside a 400-page cookbook without a table of contents. That is what AI faces if you dump entire PDFs into search unchanged. Chunking cuts documents into bite-sized pieces the system can retrieve precisely.
What Is Document Chunking?
Chunking means splitting long text into smaller segments before indexing. Each chunk becomes one row in your search index — like flashcards, each holding one idea.
Why Do We Need Chunking?
Search returns whole chunks to the LLM. If chunks are huge, you waste context window space on irrelevant paragraphs. If chunks are tiny, sentences lose meaning. Goldilocks sizing matters.
How Chunking Fits in RAG
PDF / Wiki / Email
↓ split
Chunks + metadata (source, page)
↓ embed / index
Searchable knowledge base
Step-by-Step: Chunk a Policy PDF
Step 1: Extract text from PDF (preserve headings).
Step 2: Split on H2/H3 headings first.
Step 3: If a section still exceeds 800 tokens, split by paragraph with 50-token overlap.
Step 4: Attach metadata: documentId, sectionTitle, pageNumber.
Step 5: Sample C# pseudo-logic:
foreach (var section in document.Sections)
{
var chunks = SplitWithOverlap(section.Text, maxTokens: 512, overlap: 64);
foreach (var chunk in chunks)
yield return new DocumentChunk(section.Title, chunk);
}
Real-World Example
A hardware manual describes fifty products. Chunking by product section means a question about "Model X battery life" retrieves only Model X pages — not the entire manual confusing the model with Model Y specs.
Common Misconceptions
"One chunk size fits all." Legal contracts and chat logs need different strategies.
"Overlap is wasted storage." Overlap prevents answers from missing sentences split across chunk borders.
Chunking Strategies Compared
- Fixed-size — split every N tokens; simple but may cut sentences awkwardly.
- Structure-aware — split on headings and paragraphs; best for manuals and wikis.
- Semantic — use models to detect topic boundaries; advanced but powerful for mixed content.
Begin with structure-aware chunking on clean Markdown or HTML exports before chasing fancy semantic splitters.
Why Metadata Matters
Store department, productLine, effectiveDate on each chunk. Filters narrow retrieval so HR questions do not pull IT outage posts just because both mention "policy."
Scanned PDFs Need OCR
Image-only PDFs contain no selectable text until OCR (Optical Character Recognition) extracts words. Azure AI Search skillsets can OCR during indexing. Chunking empty text produces useless embeddings — always verify extracted text quality on a sample page before indexing thousands of scans.
Keep a spreadsheet during experiments: chunk size, overlap, hit rate on ten test questions. Small disciplined experiments beat random tweaking. Share results with teammates so everyone stops arguing from gut feeling and starts arguing from numbers — healthier engineering culture.
Handling Code and Tables
Technical docs mix prose, code blocks, and tables. Keep code blocks intact in single chunks when possible — splitting mid-function confuses both search and models. Prefix chunks with language tags in metadata so developers filter API reference separately from marketing PDFs.
Tables should include header row text in every chunk derived from the table — otherwise a chunk containing only numeric rows loses column meaning. Prepending 'Table: Server SKUs — columns: Name, RAM, Price' saves retrieval quality on specification sheets.
Summary
Chunking is invisible to users but makes or breaks RAG. Split thoughtfully, store metadata, and test retrieval with real questions from Lesson 9.
Frequently Asked Questions
Key Takeaways
- Chunking splits large documents into searchable pieces.
- Size and overlap strongly affect retrieval quality.
- Split on headings and paragraphs when possible.
- Metadata (title, page, section) helps filtering and citations.
- Test chunk settings with real user questions, not guesses.