Skip to content

9.8.1 Evaluation and Safety Roadmap: Score, Guard, Trace

An Agent should not only run. You must know whether it succeeded, whether the process was safe, and where the failure happened.

Agent guardrails layer diagram

Agent evaluation and safety chapter learning flow

Agent risk debugging closed loop diagram

Evaluation tells you whether the system works. Safety tells you what it may do. Observability tells you where it broke.

Evaluate both final output and execution process.

run = {
"task_success": True,
"tool_error": False,
"permission_confirmed": True,
"trace_saved": True,
"cost_usd": 0.08,
}
launch_ok = (
run["task_success"]
and not run["tool_error"]
and run["permission_confirmed"]
and run["trace_saved"]
and run["cost_usd"] < 0.10
)
print("launch_ok:", launch_ok)
print("scorecard:", "task, tools, safety, trace, cost")

Expected output:

Terminal window
launch_ok: True
scorecard: task, tools, safety, trace, cost

One smooth final answer is not enough evidence. Keep replayable tasks and process traces.

StepReadPractice Output
1Evaluation methodsSeparate result evaluation from process evaluation
2BenchmarksUse public benchmarks as reference, not a product replacement
3Safety and alignmentIdentify prompt injection, over-permission, leakage, hallucination
4GuardrailsAdd input filter, output validation, permissions, human confirmation
5ObservabilitySave logs, traces, errors, latency, cost, and failure reason

Keep this page’s proof of learning as a small evidence card:

Eval Cases
fixed tasks and expected safe behavior
Scorecard
task success, tool correctness, trace quality, safety
Guardrail
policy, permission, validation, or human confirmation
Failure Check
unsafe tool use, prompt injection, hidden state, or unobserved action
Next Action
add case, guardrail, log, rollback, or refusal path

You pass this chapter when every Agent run can be reviewed through goal, plan, tool calls, observations, final answer, safety rule, cost, and failure reason.

The exit mini project is a 10 to 20 task evaluation set plus at least 3 safety rules.

Check reasoning and explanation
  1. A passing answer describes the agent loop: goal, plan, tool call, observation, memory or state update, and stop condition.
  2. The evidence should include a trace that another developer can inspect, not only the final answer.
  3. A good self-check names one safety or reliability control such as tool schemas, permission boundaries, retries, evaluation cases, or a human-review point.