Lesson 3 — Beginner

Document Chunking for Beginners

RAGBeginnerTutorial

Try finding one recipe step inside a 400-page cookbook without a table of contents. That is what AI faces if you dump entire PDFs into search unchanged. Chunking cuts documents into bite-sized pieces the system can retrieve precisely.

What Is Document Chunking?

Chunking means splitting long text into smaller segments before indexing. Each chunk becomes one row in your search index — like flashcards, each holding one idea.

Why Do We Need Chunking?

Search returns whole chunks to the LLM. If chunks are huge, you waste context window space on irrelevant paragraphs. If chunks are tiny, sentences lose meaning. Goldilocks sizing matters.

How Chunking Fits in RAG

PDF / Wiki / Email
      ↓ split
Chunks + metadata (source, page)
      ↓ embed / index
Searchable knowledge base

Step-by-Step: Chunk a Policy PDF

Step 1: Extract text from PDF (preserve headings).

Step 2: Split on H2/H3 headings first.

Step 3: If a section still exceeds 800 tokens, split by paragraph with 50-token overlap.

Step 4: Attach metadata: documentId, sectionTitle, pageNumber.

Step 5: Sample C# pseudo-logic:

foreach (var section in document.Sections)
{
    var chunks = SplitWithOverlap(section.Text, maxTokens: 512, overlap: 64);
    foreach (var chunk in chunks)
        yield return new DocumentChunk(section.Title, chunk);
}

Real-World Example

A hardware manual describes fifty products. Chunking by product section means a question about "Model X battery life" retrieves only Model X pages — not the entire manual confusing the model with Model Y specs.

Common Misconceptions

"One chunk size fits all." Legal contracts and chat logs need different strategies.

"Overlap is wasted storage." Overlap prevents answers from missing sentences split across chunk borders.

Chunking Strategies Compared

  • Fixed-size — split every N tokens; simple but may cut sentences awkwardly.
  • Structure-aware — split on headings and paragraphs; best for manuals and wikis.
  • Semantic — use models to detect topic boundaries; advanced but powerful for mixed content.

Begin with structure-aware chunking on clean Markdown or HTML exports before chasing fancy semantic splitters.

Why Metadata Matters

Store department, productLine, effectiveDate on each chunk. Filters narrow retrieval so HR questions do not pull IT outage posts just because both mention "policy."

Scanned PDFs Need OCR

Image-only PDFs contain no selectable text until OCR (Optical Character Recognition) extracts words. Azure AI Search skillsets can OCR during indexing. Chunking empty text produces useless embeddings — always verify extracted text quality on a sample page before indexing thousands of scans.

Keep a spreadsheet during experiments: chunk size, overlap, hit rate on ten test questions. Small disciplined experiments beat random tweaking. Share results with teammates so everyone stops arguing from gut feeling and starts arguing from numbers — healthier engineering culture.

Handling Code and Tables

Technical docs mix prose, code blocks, and tables. Keep code blocks intact in single chunks when possible — splitting mid-function confuses both search and models. Prefix chunks with language tags in metadata so developers filter API reference separately from marketing PDFs.

Tables should include header row text in every chunk derived from the table — otherwise a chunk containing only numeric rows loses column meaning. Prepending 'Table: Server SKUs — columns: Name, RAM, Price' saves retrieval quality on specification sheets.

Summary

Chunking is invisible to users but makes or breaks RAG. Split thoughtfully, store metadata, and test retrieval with real questions from Lesson 9.

Frequently Asked Questions

A small segment of a document — often a few hundred tokens — stored and searched independently in RAG.

Common starting point: 500–1000 characters or 256–512 tokens. Tune with real questions.

Repeating the last few sentences of one chunk at the start of the next so context is not cut mid-thought.

Prefer semantic boundaries — headings and paragraphs — over arbitrary page breaks.

Yes, but keep table rows together or add captions so numbers are not orphaned from headers.

Search returns vague blobs; the model gets distracted by irrelevant text in the same chunk.

Key Takeaways

  • Chunking splits large documents into searchable pieces.
  • Size and overlap strongly affect retrieval quality.
  • Split on headings and paragraphs when possible.
  • Metadata (title, page, section) helps filtering and citations.
  • Test chunk settings with real user questions, not guesses.

Suggested Next Reads

Share: LinkedIn Facebook X

Need help implementing this in your organization?

Contact Emerrank Consultancy