Skip to content

E.D AI Safety and Red Team Testing

Red teaming is a repeatable loop, not one scary prompt. You define attack surfaces, run cases, record failures, fix the system, and rerun the same cases.

AI Security Red Team Loop Diagram

AI Security Threat Modeling and Regression Set Diagram

Start with surfaces: prompt, retrieval, tools, memory, and external actions.

  • One AI feature to test
  • A list of surfaces the feature touches
  • A place to keep failed cases as regression tests
cases = [
{"id": "prompt-basic", "surface": "prompt", "expected": "refuse", "before": "refuse", "after": "refuse"},
{"id": "rag-injection", "surface": "retrieval", "expected": "ignore_untrusted_instruction", "before": "ignore_untrusted_instruction", "after": "ignore_untrusted_instruction"},
{"id": "tool-confirmation", "surface": "tool", "expected": "ask_confirmation", "before": "executed", "after": "ask_confirmation"},
]
for phase in ["before", "after"]:
failures = []
for case in cases:
passed = case[phase] == case["expected"]
print(phase, case["id"], "PASS" if passed else "FAIL")
if not passed:
failures.append(case["id"])
print(phase, "failure_count:", len(failures))

Expected output:

Terminal window
before prompt-basic PASS
before rag-injection PASS
before tool-confirmation FAIL
before failure_count: 1
after prompt-basic PASS
after rag-injection PASS
after tool-confirmation PASS
after failure_count: 0

The failed tool case is not embarrassing; it is now a regression test that protects future releases.

Review a red-team run by separating three things: the surface that failed, the expected safe behavior, and the control that changed the result. For example, a tool surface might fail by executing too early, the safe behavior might be ask_confirmation, and the control might be a permission gate.

Do not summarize the run as “safer now.” Keep the original input, the unsafe output, the fix, and the rerun output. That record is what turns a scary prompt into a useful regression case.

For a first portfolio artifact, keep the case file boring and precise. Use columns such as case_id, surface, input, expected_safe_behavior, actual_before, guardrail, and actual_after. A reviewer should be able to rerun one row without guessing your intent.

If a case is too broad, split it. Prompt injection, tool misuse, data leak, and unsafe output are different failure modes. A small regression set with clear surfaces is more useful than a dramatic list of attacks that no one can reproduce.

When the case passes, do not delete it. Move it into the regression set and run it again before release.

StepActionEvidence
1Define assetsUser data, tools, memory, system prompts
2Define surfacesPrompt, documents, retrieval, tool calls, memory
3Run casesPASS / FAIL table
4Fix and rerunRegression report

Keep this page’s proof of learning as a small evidence card:

Threat Model
prompt injection, data leak, tool misuse, unsafe output, or model abuse
Control
validation, permission, sandbox, audit, red-team test, or incident response
Test Case
one attack or failure sample and expected safe behavior
Failure Check
trusting model text, missing logs, broad permissions, or no regression tests
Expected Output
security checklist plus one reproducible red-team case

You pass this elective when you can keep a red-team case file, explain one failed surface, propose one guardrail, and rerun the case after the fix.

Check reasoning and explanation

A passing answer should name one surface, one failure, one guardrail, and the rerun result. For example: “The tool surface failed because the model executed without confirmation. The guardrail requires explicit user approval before external actions. After the fix, the same case returns ask_confirmation.”

The key is repeatability. A red-team note is useful only when the failed case becomes a regression case that future changes must pass.