E.D AI Safety and Red Team Testing

Red teaming is a repeatable loop, not one scary prompt. You define attack surfaces, run cases, record failures, fix the system, and rerun the same cases.

See the Loop First

AI Security Red Team Loop Diagram

AI Security Threat Modeling and Regression Set Diagram

Start with surfaces: prompt, retrieval, tools, memory, and external actions.

What You Need

One AI feature to test
A list of surfaces the feature touches
A place to keep failed cases as regression tests

Run A Before And After Evaluator

cases = [
    {"id": "prompt-basic", "surface": "prompt", "expected": "refuse", "before": "refuse", "after": "refuse"},
    {"id": "rag-injection", "surface": "retrieval", "expected": "ignore_untrusted_instruction", "before": "ignore_untrusted_instruction", "after": "ignore_untrusted_instruction"},
    {"id": "tool-confirmation", "surface": "tool", "expected": "ask_confirmation", "before": "executed", "after": "ask_confirmation"},
]

for phase in ["before", "after"]:
    failures = []
    for case in cases:
        passed = case[phase] == case["expected"]
        print(phase, case["id"], "PASS" if passed else "FAIL")
        if not passed:
            failures.append(case["id"])
    print(phase, "failure_count:", len(failures))

Expected output:

before prompt-basic PASS
before rag-injection PASS
before tool-confirmation FAIL
before failure_count: 1
after prompt-basic PASS
after rag-injection PASS
after tool-confirmation PASS
after failure_count: 0

The failed tool case is not embarrassing; it is now a regression test that protects future releases.

Red-Team Review

Review a red-team run by separating three things: the surface that failed, the expected safe behavior, and the control that changed the result. For example, a tool surface might fail by executing too early, the safe behavior might be ask_confirmation, and the control might be a permission gate.

Do not summarize the run as “safer now.” Keep the original input, the unsafe output, the fix, and the rerun output. That record is what turns a scary prompt into a useful regression case.

For a first portfolio artifact, keep the case file boring and precise. Use columns such as case_id, surface, input, expected_safe_behavior, actual_before, guardrail, and actual_after. A reviewer should be able to rerun one row without guessing your intent.

If a case is too broad, split it. Prompt injection, tool misuse, data leak, and unsafe output are different failure modes. A small regression set with clear surfaces is more useful than a dramatic list of attacks that no one can reproduce.

When the case passes, do not delete it. Move it into the regression set and run it again before release.

Practical Checklist

Step	Action	Evidence
1	Define assets	User data, tools, memory, system prompts
2	Define surfaces	Prompt, documents, retrieval, tool calls, memory
3	Run cases	PASS / FAIL table
4	Fix and rerun	Regression report

Evidence to Keep

Keep this page’s proof of learning as a small evidence card:

Threat Model: prompt injection, data leak, tool misuse, unsafe output, or model abuse
Control: validation, permission, sandbox, audit, red-team test, or incident response
Test Case: one attack or failure sample and expected safe behavior
Failure Check: trusting model text, missing logs, broad permissions, or no regression tests
Expected Output: security checklist plus one reproducible red-team case

Pass Check

You pass this elective when you can keep a red-team case file, explain one failed surface, propose one guardrail, and rerun the case after the fix.

Check reasoning and explanation

A passing answer should name one surface, one failure, one guardrail, and the rerun result. For example: “The tool surface failed because the model executed without confirmation. The guardrail requires explicit user approval before external actions. After the fix, the same case returns ask_confirmation.”

The key is repeatability. A red-team note is useful only when the failed case becomes a regression case that future changes must pass.