Skip to content

9.8.3 Agent Benchmarking

Agent benchmark and custom eval set comparison

  • Understand the value and limitations of general benchmarks
  • Know why business Agents must have custom evaluation sets
  • Be able to design a small project benchmark
  • Avoid ignoring real tasks just to chase leaderboard scores

The purpose of a benchmark is to provide a fixed set of tasks so different models or systems can be compared. For example, a coding Agent can be evaluated on bug-fixing ability, a web Agent can be evaluated on browser operation ability, and a tool Agent can be evaluated on multi-step tool use.

flowchart LR
A[Fixed task set] --> B[Unified scoring]
B --> C[Compare models or systems]
C --> D[Find capability boundaries]

Its value lies in being repeatable, comparable, and useful for observing trends. But it does not necessarily represent your real business use case.

TypeEvaluation FocusTypical Tasks
Code-basedModify code, fix tests, understand repositoriesFix issues, pass unit tests
Web-basedBrowse webpages, fill forms, find informationMulti-step browser tasks
Tool-callingChoose tools, generate parameters, handle resultsAPI calls, function composition
Long-horizon tasksPlan, execute, recover, summarizeResearch, analysis, report generation

When learning these benchmarks, the key is not to memorize the names, but to understand how they define tasks, inputs, scoring, and failures.

Why You Still Need a Custom Project Evaluation Set

Section titled “Why You Still Need a Custom Project Evaluation Set”

General benchmarks cannot cover your course docs, your tool permissions, your user goals, and your business constraints. For example, your “AI learning assistant” needs to answer course questions, generate study plans, cite chapter sources, and avoid inventing course content. All of these must be tested with your own evaluation set.

A custom evaluation set should include at least 20 samples: 10 normal tasks, 5 boundary cases, 3 tool failure cases, and 2 safety or permission cases. Each sample should have clear success criteria.

You can organize the 20 samples like this:

GroupCountExample
Normal tasks10Generate a study plan, answer a chapter question, summarize a concept
Boundary tasks5User asks vaguely, mixes multiple stages, or uses an incorrect chapter name
Tool failure tasks3Search returns empty, API timeout, document parser fails
Safety / permission tasks2User asks the Agent to delete files or send content without confirmation

This distribution prevents a common beginner mistake: testing only the happy path.

{
"id": "course_agent_008",
"task": "Help me create a one-week RAG study plan and cite the course entry point",
"expected_capabilities": ["retrieve course docs", "generate a plan", "provide sources"],
"must_include": ["RAG basics", "retrieval strategy", "RAG evaluation"],
"must_not_do": ["invent non-existent chapters", "call the write-file tool"],
"scoring": {
"coverage": 2,
"source_accuracy": 2,
"plan_quality": 1
}
}

This example is more actionable than simply asking whether the answer is satisfactory, because it clearly defines what must be included, what must not be done, and how to score it.

A benchmark becomes useful only when you can run the same cases again after changing a Prompt, model, tool schema, or retrieval strategy.

Here is a very small scoring example:

sample = {
"id": "course_agent_008",
"must_include": ["RAG basics", "retrieval strategy", "RAG evaluation"],
"must_not_do": ["invent non-existent chapters", "call the write-file tool"],
}
answer = """
This one-week plan covers RAG basics, retrieval strategy, and RAG evaluation.
It cites the course RAG entry chapter and returns the plan as text only.
"""
def score_answer(sample, answer):
answer_lower = answer.lower()
include_hits = sum(item.lower() in answer_lower for item in sample["must_include"])
forbidden_hits = sum(item.lower() in answer_lower for item in sample["must_not_do"])
return {
"coverage": include_hits / len(sample["must_include"]),
"forbidden_violations": forbidden_hits,
"pass": include_hits == len(sample["must_include"]) and forbidden_hits == 0,
}
print(score_answer(sample, answer))

Expected output:

Terminal window
{'coverage': 1.0, 'forbidden_violations': 0, 'pass': True}

This is deliberately simple. In a real Agent benchmark, you would also inspect:

  • Whether cited chapters actually exist
  • Whether the Agent used allowed tools only
  • Whether it recovered from empty retrieval results
  • Whether it asked for confirmation before risky actions
  • Whether latency and cost stayed within acceptable limits

Benchmarks are easy to overfit. A system may perform very well on fixed tasks, but become unstable when given real user input. Benchmarks may also ignore cost, latency, safety, and maintainability. For Agents, whether the execution trace is explainable is sometimes more important than the final score.

Start with general benchmarks to build intuition about capability, then use a custom evaluation set to validate project quality. Every time you change the Prompt, switch models, modify the tool schema, or add a retrieval strategy, run the same evaluation set again. That way, you can tell whether the change improved performance, made it worse, or only changed the output style.

Keep this page’s proof of learning as a small evidence card:

Eval Cases
fixed tasks and expected safe behavior
Scorecard
task success, tool correctness, trace quality, safety
Guardrail
policy, permission, validation, or human confirmation
Failure Check
unsafe tool use, prompt injection, hidden state, or unobserved action
Next Action
add case, guardrail, log, rollback, or refusal path

The first mistake is treating benchmark scores as production quality. The second is only testing normal tasks and not testing failures or boundary cases. The third is having too few evaluation samples and judging the system based on a few demos. The fourth is not saving historical results, which makes version comparison impossible.

  1. Design 20 benchmark samples for your course Q&A assistant.
  2. Write must_include, must_not_do, and scoring rules for each sample.
  3. Design 3 tool failure scenarios, such as empty retrieval results, API timeout, or insufficient permissions.
  4. Explain why benchmarks cannot replace production monitoring.

After completing this section, you should be able to explain the difference between general benchmarks and custom evaluation sets, design a small benchmark for your own Agent project, and use a fixed evaluation set to compare the effectiveness of different models, Prompts, and tool designs.

Project reference and review notes
  1. A solid 20-sample benchmark should mix easy, medium, and hard course questions, include citation-required cases, include out-of-scope questions, and include questions where retrieval returns partial or conflicting evidence.
  2. must_include should list required concepts or evidence, must_not_do should block hallucinated citations or unsafe actions, and scoring rules should explain how to assign partial credit.
  3. Tool failure scenarios should test empty retrieval, timeout, permission denial, malformed tool output, and stale data. The expected behavior is graceful recovery or a clear stop, not confident guessing.
  4. Benchmarks cannot replace production monitoring because real users create new wording, new goals, latency pressure, cost spikes, data drift, and tool failures that the fixed benchmark never anticipated.