Skip to content

9.2.7 Reasoning Evaluation and Optimization

  • Understand why reasoning systems cannot be judged only by final answer accuracy
  • Understand why process quality, tool efficiency, and cost also need evaluation
  • Learn common reasoning metrics through a runnable example
  • Learn how to optimize in a targeted way based on evaluation results

Why is reasoning evaluation more complex than ordinary QA?

Section titled “Why is reasoning evaluation more complex than ordinary QA?”

Because a reasoning system does not just output a single text

Section titled “Because a reasoning system does not just output a single text”

A reasoning Agent often also produces:

  • multi-step trace
  • tool call records
  • intermediate states
  • final answer

So this is not a single-point output problem, but a process-oriented system problem.

Looking only at final accuracy misses a lot of information

Section titled “Looking only at final accuracy misses a lot of information”

For example, two systems are both 80% correct:

  • System A averages 3 steps and almost never uses tools unnecessarily
  • System B averages 9 steps, often repeats searches, and has double the cost

If you only look at accuracy, you may think they are about the same; but in engineering terms, they are at completely different levels.

An analogy: not only whether you reached the destination, but also how you got there

Section titled “An analogy: not only whether you reached the destination, but also how you got there”

If two cars both reach the destination:

  • one drives there smoothly
  • the other takes a winding route, brakes hard, and nearly causes an accident

You would not say they performed the same. The same applies to reasoning systems.


The four most common types of metrics for reasoning systems

Section titled “The four most common types of metrics for reasoning systems”

The most common ones are:

  • answer accuracy
  • exact match
  • pass rate

These answer:

  • Is the final conclusion correct?

For example:

  • Were any critical steps missed?
  • Are there contradictions?
  • Is there any useless looping?

These answer:

  • Is the process reliable?

For example:

  • tool success rate
  • repeated call rate
  • unnecessary call rate

These answer:

  • Are tools being used properly?

For example:

  • average number of steps
  • average latency
  • average token cost

These answer:

  • Is the system worth deploying?

First, run a truly useful evaluation script

Section titled “First, run a truly useful evaluation script”

The code below compares the trace quality of two agents. It counts:

  • final answer accuracy
  • average number of steps
  • tool success rate
  • repeated tool call rate
agent_a = [
{
"id": "case_1",
"expected": "59",
"final_answer": "59",
"trace": [
{"tool": "calculator", "ok": True},
],
},
{
"id": "case_2",
"expected": "3-7 working days",
"final_answer": "3-7 working days",
"trace": [
{"tool": "search_policy", "ok": True},
],
},
]
agent_b = [
{
"id": "case_1",
"expected": "59",
"final_answer": "59",
"trace": [
{"tool": "search_policy", "ok": True},
{"tool": "calculator", "ok": True},
{"tool": "calculator", "ok": True},
],
},
{
"id": "case_2",
"expected": "3-7 working days",
"final_answer": "5-10 working days",
"trace": [
{"tool": "search_policy", "ok": False},
{"tool": "search_policy", "ok": True},
],
},
]
def evaluate_agent(cases):
accuracy = sum(case["final_answer"] == case["expected"] for case in cases) / len(cases)
avg_steps = sum(len(case["trace"]) for case in cases) / len(cases)
tool_calls = [item for case in cases for item in case["trace"]]
tool_success = sum(item["ok"] for item in tool_calls) / len(tool_calls)
repeated_tool_calls = 0
for case in cases:
tools = [item["tool"] for item in case["trace"]]
repeated_tool_calls += len(tools) - len(set(tools))
repeated_rate = repeated_tool_calls / len(cases)
return {
"accuracy": round(accuracy, 3),
"avg_steps": round(avg_steps, 3),
"tool_success": round(tool_success, 3),
"repeated_tool_calls_per_case": round(repeated_rate, 3),
}
print("agent_a:", evaluate_agent(agent_a))
print("agent_b:", evaluate_agent(agent_b))

Expected output:

Terminal window
agent_a: {'accuracy': 1.0, 'avg_steps': 1.0, 'tool_success': 1.0, 'repeated_tool_calls_per_case': 0.0}
agent_b: {'accuracy': 0.5, 'avg_steps': 2.5, 'tool_success': 0.8, 'repeated_tool_calls_per_case': 1.0}

What is the most important takeaway from this code?

Section titled “What is the most important takeaway from this code?”

The most important thing is not a formula, but the way of thinking it shows:

For the same system, you should at least look at answer quality, process length, and tool performance at the same time.

Only by looking at all three together can you know whether the system is truly stable or just occasionally getting the right answer.

Why does agent_b not necessarily look much worse at first glance, but is actually worse in engineering terms?

Section titled “Why does agent_b not necessarily look much worse at first glance, but is actually worse in engineering terms?”

Because it may have:

  • more steps
  • more repeated tool calls
  • a need for recovery after tool failures

Even if it gets some individual cases right in the end, the cost is still higher.

Why is repeated call rate worth tracking separately?

Section titled “Why is repeated call rate worth tracking separately?”

Because many Agent problems are not “completely incapable,” but rather:

  • not decisive enough
  • repeatedly trying the same tool
  • doing many unnecessary actions

This directly slows the system down and increases cost.

Agent reasoning failure taxonomy diagram


Evaluation should not only ask “Was the answer correct?”

Section titled “Evaluation should not only ask “Was the answer correct?””

For example:

  • math problems
  • rule-based Q&A
  • clear retrieval questions

For process-based tasks, look at whether the steps are reasonable

Section titled “For process-based tasks, look at whether the steps are reasonable”

For example:

  • Were any critical steps missed?
  • Was the conclusion drawn too early?
  • Was it checked before calculating?

For Agent tasks, look at whether the actions are worthwhile

Section titled “For Agent tasks, look at whether the actions are worthwhile”

For example:

  • Were there unnecessary tool calls?
  • Was there an endless loop after tool failure?
  • Did it stop in time when enough information was available?

The timing of stopping is itself part of the capability.


After getting evaluation results, how should you optimize?

Section titled “After getting evaluation results, how should you optimize?”

Prioritize checking:

  • whether the problem was misunderstood
  • whether the tool choice was wrong
  • whether the observations were integrated incorrectly

If accuracy is acceptable but the number of steps is too high

Section titled “If accuracy is acceptable but the number of steps is too high”

Prioritize checking:

  • whether tools are being called repeatedly
  • whether it should stop earlier
  • whether steps can be merged

Prioritize checking:

  • whether the schema is clearly defined
  • whether parameter generation is stable
  • whether observations are sufficiently structured

If performance differs a lot across task types

Section titled “If performance differs a lot across task types”

You should do bucketed analysis by task type. For example:

  • arithmetic tasks
  • policy retrieval tasks
  • multi-constraint planning tasks

Only then can you make targeted improvements.


How should evaluation samples be designed?

Section titled “How should evaluation samples be designed?”

Otherwise, the system will easily look like “it’s all pretty good.” You should deliberately include:

  • questions that are easy to misjudge
  • questions that require multi-step tool coordination
  • questions that can easily lead to endless loops

For example:

  • keeps going when it should stop
  • calls tools randomly when it should not
  • cannot recover after a tool failure

Keep the fixed evaluation set for the long term

Section titled “Keep the fixed evaluation set for the long term”

That way, every time you change the prompt, strategy, or tools, you can make a meaningful before / after comparison.


Misconception 1: If the final answer is correct, the system has no problems

Section titled “Misconception 1: If the final answer is correct, the system has no problems”

Not necessarily. It may simply be:

  • very inefficient in the process
  • too expensive
  • unstable

Misconception 2: More metrics are always better

Section titled “Misconception 2: More metrics are always better”

Metrics are not a collection hobby. What matters is:

  • can the metric explain the problem?
  • can the metric guide optimization?

Misconception 3: You can iterate by instinct even without a fixed benchmark

Section titled “Misconception 3: You can iterate by instinct even without a fixed benchmark”

If you rely only on subjective feeling, it is very easy to make the system more and more uncontrollable.


Keep this page’s proof of learning as a small evidence card:

Task Goal
what the agent is trying to solve
Plan Or Trace
reasoning steps, plan, ReAct trace, or execution graph
Observation
what changed after each action
Failure Check
hallucinated step, stale observation, loop, or unverified conclusion
Eval Action
compare against expected result and revise the plan

The most important thing in this section is not memorizing a few metric names, but building an evaluation loop:

A reasoning system should evaluate final answers, process quality, tool usage, and cost efficiency at the same time, and then make targeted improvements based on the specific weaknesses.

When you truly iterate using this loop, the Agent system can grow from a demo that “sometimes works” into a system that is explainable, improvable, and ready for production.


  1. Add one more case to agent_b in the example and see how the metrics change.
  2. Why is “final answer accuracy” not enough to fully evaluate a reasoning Agent?
  3. Think about this: if your system often calls the same tool repeatedly, which layer would you inspect first?
  4. Design at least 3 core metrics for one of your Agent tasks, and explain why they are valuable.
Reference implementation and walkthrough
  1. Adding a case can change accuracy, tool-use rate, loop count, or failure distribution; inspect which metric moved and why.
  2. Final accuracy misses bad traces: repeated tool calls, unsafe actions, missing citations, excessive cost, and lucky correct answers.
  3. First inspect routing and stopping logic: tool descriptions, observation parsing, loop termination, and retry policy.
  4. Useful metrics include task success, step count, invalid tool-call rate, recovery rate, latency and cost, groundedness, and user-visible error rate.