8.1.8 RAG Evaluation

Learning Objectives
Section titled “Learning Objectives”By the end of this section, you will be able to:
- Understand why you cannot judge RAG quality from a single demo alone
- Distinguish the different goals of retrieval evaluation and answer evaluation
- Compute simple metrics on a small sample
- Build the engineering habit of “evaluate first, then optimize”
Why does RAG especially need evaluation?
Section titled “Why does RAG especially need evaluation?”Because it is not a single module
Section titled “Because it is not a single module”RAG is not one model. It usually includes at least:
- Document processing
- Retrieval
- Context assembly
- Answer generation
If any step goes wrong, the final answer can get worse.
So you cannot just ask, “Was the answer correct?”
Section titled “So you cannot just ask, “Was the answer correct?””You also need to ask:
- Was the right evidence not retrieved?
- Or was it retrieved but not used well?
- Or was the answer simply poorly written?
That is why RAG evaluation must be viewed in layers.
The first layer: retrieval evaluation
Section titled “The first layer: retrieval evaluation”The most common intuitive metric: Hit@k
Section titled “The most common intuitive metric: Hit@k”Hit@k is very simple:
Did the correct evidence appear in the top k retrieval results?
If the correct evidence for the user’s question is in the top 3, that counts as a hit.
Why is this metric important?
Section titled “Why is this metric important?”Because if the correct material is not retrieved at all, the generation step is almost impossible to get right consistently.
So:
Retrieval evaluation is the foundation of RAG evaluation.
The second layer: answer evaluation
Section titled “The second layer: answer evaluation”Looking only at whether the response sounds fluent is far from enough
Section titled “Looking only at whether the response sounds fluent is far from enough”Answer evaluation should at least consider:
- Whether the answer is correct
- Whether it has evidence behind it
- Whether it is hallucinated
Common dimensions
Section titled “Common dimensions”| Dimension | What it focuses on |
|---|---|
| Correctness | Whether the facts in the answer are correct |
| Faithfulness | Whether it is based on the given materials |
| Relevance | Whether it answers the user’s question |
| Completeness | Whether it includes all key information |
In real business scenarios, different dimensions matter differently.


Good evaluation is a loop, not a one-time test: test set -> retrieval -> answer -> citation -> failure analysis -> fix -> re-evaluate.
A minimal evaluation dataset
Section titled “A minimal evaluation dataset”Below we manually construct a tiny evaluation set.
dataset = [ { "question": "How long after purchase can I request a refund?", "gold_doc": "Refund Policy", "gold_answer": "A refund can be requested within 7 days after course purchase" }, { "question": "How do I get a certificate?", "gold_doc": "Certificate Guide", "gold_answer": "You can get a certificate after completing the project and passing the test" }]
predictions = [ { "retrieved_docs": ["Refund Policy", "Learning Order"], "answer": "A refund can be requested within 7 days after course purchase" }, { "retrieved_docs": ["Learning Order", "Certificate Guide"], "answer": "You can get a certificate after completing the project and passing the test" }]
print(dataset)print(predictions)Expected output:
[{'question': 'How long after purchase can I request a refund?', 'gold_doc': 'Refund Policy', 'gold_answer': 'A refund can be requested within 7 days after course purchase'}, {'question': 'How do I get a certificate?', 'gold_doc': 'Certificate Guide', 'gold_answer': 'You can get a certificate after completing the project and passing the test'}][{'retrieved_docs': ['Refund Policy', 'Learning Order'], 'answer': 'A refund can be requested within 7 days after course purchase'}, {'retrieved_docs': ['Learning Order', 'Certificate Guide'], 'answer': 'You can get a certificate after completing the project and passing the test'}]Read this dataset as two columns of truth and prediction: gold_doc / gold_answer are the reference, while retrieved_docs / answer are what your RAG system produced.
Computing a simple Hit@k
Section titled “Computing a simple Hit@k”Runnable example
Section titled “Runnable example”dataset = [ { "question": "How long after purchase can I request a refund?", "gold_doc": "Refund Policy" }, { "question": "How do I get a certificate?", "gold_doc": "Certificate Guide" }]
predictions = [ { "retrieved_docs": ["Refund Policy", "Learning Order"] }, { "retrieved_docs": ["Learning Order", "Certificate Guide"] }]
hits = 0for item, pred in zip(dataset, predictions): if item["gold_doc"] in pred["retrieved_docs"]: hits += 1
hit_at_2 = hits / len(dataset)print("Hit@2 =", round(hit_at_2, 4))Expected output:
Hit@2 = 1.0If the correct document appears in the top 2 results for every item, the value is 1.0.
The limitation of this metric
Section titled “The limitation of this metric”It can only tell you whether the right document was retrieved. It cannot tell you:
- Where it ranked
- Whether the final answer is actually correct
So it is only the first step.
Computing a simple answer accuracy score
Section titled “Computing a simple answer accuracy score”The simplest Exact Match idea
Section titled “The simplest Exact Match idea”In structured short-answer scenarios, you can start with the most straightforward method:
dataset = [ { "gold_answer": "A refund can be requested within 7 days after course purchase" }, { "gold_answer": "You can get a certificate after completing the project and passing the test" }]
predictions = [ { "answer": "A refund can be requested within 7 days after course purchase" }, { "answer": "You can get a certificate after completing the project and passing the test" }]
correct = 0for item, pred in zip(dataset, predictions): if item["gold_answer"] == pred["answer"]: correct += 1
exact_match = correct / len(dataset)print("Exact Match =", round(exact_match, 4))Expected output:
Exact Match = 1.0But real-world scenarios are often not that simple
Section titled “But real-world scenarios are often not that simple”The same correct answer can have many different phrasings. So online systems often also introduce:
- Semantic matching
- LLM-as-a-judge
- Manual sampling and review
Faithfulness: is the answer supported by evidence?
Section titled “Faithfulness: is the answer supported by evidence?”This is more important than “does it sound plausible?”
Section titled “This is more important than “does it sound plausible?””An answer may read very fluently, but if it was not derived from the retrieved materials, the risk is high.
A simplified checking idea
Section titled “A simplified checking idea”The example below is very rough, but it helps you understand the concept of whether an answer is supported by evidence.
evidence = "A refund can be requested within 7 days after course purchase"answer = "A refund can be requested within 7 days after course purchase"
faithful = answer in evidence or evidence in answerprint("Supported by evidence:", faithful)Expected output:
Supported by evidence: True
Real systems of course do not rely only on string matching, but the idea is correct:
The answer should be supported by the retrieval evidence as much as possible.

How should the evaluation set be built?
Section titled “How should the evaluation set be built?”Minimal usable evaluation set
Section titled “Minimal usable evaluation set”It should include at least:
- The user question
- The reference answer
- The correct evidence document or evidence snippet
The evaluation set should cover different difficulty levels
Section titled “The evaluation set should cover different difficulty levels”For example:
- Simple factual Q&A
- Questions with paraphrased wording
- Cross-paragraph questions
- Easy-to-confuse questions
If the evaluation set is too narrow, optimization results can be misleading.
Online evaluation is also important
Section titled “Online evaluation is also important”Offline evaluation cannot represent everything
Section titled “Offline evaluation cannot represent everything”No matter how good the offline dataset is, it cannot fully cover real user questions.
Common online signals
Section titled “Common online signals”For example:
- Follow-up question rate
- User correction rate
- Likes / dislikes
- Manual quality inspection samples
A mature RAG system is usually evaluated with both offline evaluation and online feedback.
If your goal is a “knowledge-base-driven SOP document assistant,” what should the evaluation set focus on?
Section titled “If your goal is a “knowledge-base-driven SOP document assistant,” what should the evaluation set focus on?”This kind of project is not quite the same as a normal Q&A system. You are not only concerned with whether the answer sounds right. You also need to care about:
- Whether the policy materials were retrieved correctly
- Whether the handled cases were selected correctly
- Whether the checklist and final section were placed in the right position
- Whether the source can be traced back
So the evaluation table that fits this kind of project usually needs at least one more layer:
| Dimension | What it is closer to checking |
|---|---|
| Policy hit | Whether the official policy materials for this topic were found |
| Case retrieval | Whether suitable handled cases were found |
| Structural correctness | Whether policies, cases, and checklists were placed in the right sections |
| Source completeness | Whether the final output can be traced back to the original materials |
You can think of it like this:
Evaluation for an SOP document project is not just “answering correctly,” but “finding correctly, placing correctly, and citing correctly.”
A minimal evaluation example that looks more like an SOP document project
Section titled “A minimal evaluation example that looks more like an SOP document project”dataset = [ { "topic": "Refund escalation", "gold_policies": ["Duplicate billing refunds must be escalated with transaction evidence."], "gold_cases": ["A duplicate charge after failed checkout was escalated to billing with payment evidence."], }]
prediction = { "policies": ["Duplicate billing refunds must be escalated with transaction evidence."], "cases": ["A duplicate charge after failed checkout was escalated to billing with payment evidence."], "source_refs": [{"doc_id": "refund_policy_001", "page_or_slide": 3}],}
print(dataset[0])print(prediction)Expected output:
{'topic': 'Refund escalation', 'gold_policies': ['Duplicate billing refunds must be escalated with transaction evidence.'], 'gold_cases': ['A duplicate charge after failed checkout was escalated to billing with payment evidence.']}{'policies': ['Duplicate billing refunds must be escalated with transaction evidence.'], 'cases': ['A duplicate charge after failed checkout was escalated to billing with payment evidence.'], 'source_refs': [{'doc_id': 'refund_policy_001', 'page_or_slide': 3}]}This example is very small, but it helps beginners build the right evaluation intuition:
- The final object of evaluation is often not a single sentence answer
- It is an entire structured result
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Query
- one user question or test case
- Retrieved Chunks
- chunk ids, scores, and source titles
- Answer
- final response with citation or source note
- Failure Check
- missing evidence, wrong chunk, stale doc, or unsupported claim
- Next Action
- chunking, embedding, reranking, prompt, or eval change
Common beginner mistakes
Section titled “Common beginner mistakes”Only looking at one or two successful cases
Section titled “Only looking at one or two successful cases”A demo can be encouraging, but it cannot replace evaluation.
Evaluating only the answer, not the retrieval
Section titled “Evaluating only the answer, not the retrieval”Then it becomes hard to locate which layer caused the problem.
Changing the system without a fixed evaluation set
Section titled “Changing the system without a fixed evaluation set”Without a fixed evaluation set, it is hard to tell whether you improved the system or just saw random fluctuation.
RAG project evaluation metrics summary
Section titled “RAG project evaluation metrics summary”When working on a RAG project, do not just look at whether the answer sounds right. A more reliable approach is to split evaluation into four layers: retrieval, generation, citation, and system.
| Layer | Metric | Description |
|---|---|---|
| Retrieval layer | Hit rate, Recall@K, MRR | Whether the correct material was found and how high it ranked |
| Generation layer | Answer accuracy, completeness, consistency | Whether the model answered based on the material and whether it missed key conditions |
| Citation layer | Citation coverage, citation faithfulness | Whether the key conclusions in the answer can be traced to sources |
| System layer | Latency, cost, failure rate | Whether it can serve real users stably and at an acceptable cost |
For a minimal evaluation set, it is recommended to prepare 20–50 questions first, and for each question, write down the reference answer, the document that should be hit, and the key citations. That way, when you optimize chunking, embedding, reranking, or query rewriting, you can tell whether the system truly got better or whether only a few examples happened to look nicer.
Layered failure attribution table
Section titled “Layered failure attribution table”The value of evaluation is not just producing a total score. It is also about helping you know what to fix next. You can put the table below into your experiment log and attribute every failure to a layer first.
- Retrieval layer: the correct document did not enter top-k. Check query, chunking, embedding, and keyword matching; then adjust chunking, add hybrid retrieval, or use query rewrite.
- Context layer: the correct document entered top-k but not the final context. Check packing, deduplication, and length limits; then adjust ranking, compression, or packing strategy.
- Generation layer: the context contains evidence but the answer misses key conditions. Check prompt, answer format, and evidence-following behavior; then require step-by-step evidence-based answering.
- Citation layer: the answer conclusion is correct but the citation does not support it. Check
source_refs, snippets, and answer sentences; then forbid unsupported citations. - Product layer: offline evaluation looks good but users still ask many follow-up questions. Check real question distribution and evaluation-set coverage; then add online questions to the evaluation set.
If you only look at “final answer accuracy,” these issues will be mixed together. Layered attribution makes optimization actions clearer: if retrieval is wrong, do not start by tuning the prompt; if citations are wrong, do not only check whether the answer is fluent.
A reusable evaluation record template
Section titled “A reusable evaluation record template”When working on a RAG project, it is recommended to keep a fixed format for every evaluation round. Even if you start with only a dozen questions, this is much more stable than looking at a single demo.
| Field | Example | Purpose |
|---|---|---|
question | How long after purchase can I request a refund? | User question |
gold_doc | Refund Policy | The material that should be hit |
gold_answer | A refund can be requested within 7 days after course purchase | Reference answer or key fact |
retrieved_docs | Refund Policy; Learning Order | Documents actually retrieved |
answer | Refund can be requested within 7 days | System answer |
citation_ok | true | Whether the citation supports the answer |
failure_type | none / retrieval / generation / citation | Failure attribution |
notes | Correct hit and supported by citation | Manual notes |
A minimal CSV can look like this:
question,gold_doc,gold_answer,retrieved_docs,answer,citation_ok,failure_type,notesHow long after purchase can I request a refund?,Refund Policy,A refund can be requested within 7 days after course purchase,"Refund Policy;Learning Order",A refund can be requested within 7 days after course purchase,true,none,Correct hit and supported by citationHow do I get a certificate?,Certificate Guide,You can get a certificate after completing the project and passing the test,"Learning Order;Certificate Guide",You can get a certificate after completing the project,false,generation,Missing the key condition of passing the testThe key point of this template is not the number of fields, but that each sample can answer three questions: what should have been hit, what was actually hit, and whether the final answer was supported by evidence.
Acceptance rubric for SOP-document RAG
Section titled “Acceptance rubric for SOP-document RAG”If the project goal is to generate SOP documents or internal operating docs, evaluation should not stop at the Q&A level. The rubric below can be used as an acceptance checklist for a portfolio project.
- Practice level: retrieve topic-related materials, generate basic answers or snippets, and display source filenames.
- Project level: retrieve policies, cases, and checklists by topic and content type; organize output into fixed SOP sections; attach sources to key sections.
- Portfolio level: keep a fixed evaluation set and failure samples; explain whether failures came from retrieval, generation, or templates; trace key conclusions back to source text.
- Interview level: compare baseline, hybrid retrieval, and reranking; explain quality, cost, and latency trade-offs; spot-check citation authenticity and record improvements.
You can put this table directly into your project README. It shows that you did not just build a “can answer questions” demo, but are evaluating a knowledge-base-driven system with an engineering mindset.
Summary
Section titled “Summary”The most important takeaway from this section is:
RAG evaluation is not a nice-to-have; it is the steering wheel of system iteration.
Without evaluation, you can only optimize by feel. With evaluation, you can know where the problem is and whether a change truly brought improvement.
Exercises
Section titled “Exercises”- Add 3 more questions to the evaluation set, and manually write
gold_docandgold_answerfor them. - Modify
predictionsso that one answer is wrong on purpose, then recompute Hit@k and Exact Match. - Think about this: if Hit@k is very high but the final answer is still often wrong, which layer is the problem more likely to be in?
Reference implementation and walkthrough
- Good evaluation questions should include easy direct lookup, synonym wording, permission-sensitive cases, and at least one confusing near match.
gold_docandgold_answershould be written before looking at model output. - Hit@k measures whether the right evidence was retrieved; Exact Match measures whether the final answer matches the expected answer. A deliberately wrong answer should lower answer metrics even if retrieval remains correct.
- High Hit@k with wrong answers usually points to the generation, context-packing, citation, or answer-verification layer rather than the first-stage retriever.