Your RAG demo wowed the class. Then a professor asked one tricky question and the bot cited the wrong policy year. How do you know if RAG "works" before embarrassment scales to production?
Evaluation means measuring retrieval quality and answer correctness systematically — not vibes.
What to Measure
- Retrieval — Did the right chunks appear in top results?
- Generation — Is the answer faithful to those chunks?
- User success — Did the human get their task done?
Step-by-Step: Beginner Evaluation Loop
Step 1: Write 30 questions employees or students actually ask.
Step 2: For each, note the correct source document and section.
Step 3: Run retrieval only — check if gold section appears in top 5.
Step 4: Run full RAG — compare answer to expected facts.
Step 5: Score simply: pass / partial / fail.
public record RagTestCase(string Question, string ExpectedFact, string SourceDoc);
// Log retrieval IDs and model answer for each case
Real-World Example
A telecom support team maintains fifty verified Q&A pairs. Each release candidate must score at least 85% pass before deploy. Regressions get caught in CI — like unit tests for answers.
Common Misconceptions
"LLM sounds confident so it is correct." Confidence and accuracy diverge — measure facts.
"One demo question is enough." Edge cases live in boring footnotes — test broadly.
Beginner Metrics Glossary
- Hit rate @5 — percent of questions where the gold chunk appears in top five results.
- Faithfulness — does the answer match retrieved text without adding fiction?
- Answer relevance — does the response actually address the question?
Track hit rate while tuning chunking; track faithfulness while tuning prompts and models.
CI for RAG
Wire your golden test set into a nightly pipeline. Fail the build if hit rate drops below threshold — same discipline as unit tests for code, applied to answers.
Human Review Loop
Sample fifty production questions weekly. Experts score answers 1–5. Track average score over time. Spikes in low scores often trace to a bad document upload or broken indexer — evaluation connects user pain to root cause faster than guessing.
Invite a non-engineer — support agent, teaching assistant — to write golden questions. They know real phrasing users type, including typos and abbreviations engineers forget. Diversity in test questions improves retrieval tuning faster than engineers testing only perfectly typed queries.
Regression Gates
Before merging prompt or chunking changes, run automated eval on golden set in CI. Block merge if hit rate drops more than two percent. Treat RAG config like code — unreviewed portal tweaks cause regressions same as unreviewed code commits.
Share eval dashboards with product owners so quality discussions use numbers. 'Users complained' becomes 'faithfulness dropped from 92% to 81% after indexer change' — actionable, blameless, fixable with engineering precision instead of hallway rumors.
Summary
RAG without evaluation is hope-driven development. Build a golden dataset, score retrieval and grounding, iterate — the same engineering discipline you apply to code tests.
Frequently Asked Questions
Key Takeaways
- Evaluate retrieval and generation separately.
- Build a golden question set from real user queries.
- Track grounding — answers must match retrieved text.
- User feedback complements but does not replace test sets.
- Improvement loops: measure → fix chunking/search → measure again.