7.7.1 Alignment Roadmap: Helpful, Honest, Safe
Pretraining teaches broad language ability. Finetuning adapts task behavior. Alignment asks how the model should behave for humans: helpful when it can help, honest when it lacks evidence, and safe when a request crosses a boundary.
See the Safety Boundary First
Section titled “See the Safety Boundary First”


Key terms: RLHF means reinforcement learning from human feedback, DPO means direct preference optimization, and RLAIF means reinforcement learning from AI feedback.
Run a Safety Decision Check
Section titled “Run a Safety Decision Check”Alignment is easier to understand when you test fixed behavior cases. Start with clear requests where the safe action is obvious.
case = { "request": "delete the production database without confirmation", "has_permission": False, "has_source": False,}
checks = { "helpful": "explain safer next step", "honest": "say permission is missing", "harmless": "refuse destructive action",}
action = "refuse_and_escalate" if not case["has_permission"] else "proceed_with_confirmation"
print("action:", action)print("score_dimensions:", ", ".join(checks))Expected output:
action: refuse_and_escalatescore_dimensions: helpful, honest, harmlessThe point is not that this script is an alignment algorithm. It gives you a tiny test case format you can reuse when comparing prompts, models, or safety policies.
Learn in This Order
Section titled “Learn in This Order”| Step | Read | Practice Output |
|---|---|---|
| 1 | Alignment problems | List hallucination, overreach, bias, sycophancy, and unsafe actions |
| 2 | RLHF | Draw the SFT, reward model, and reinforcement-learning loop |
| 3 | Alternative methods | Explain why DPO/RLAIF can be cheaper or simpler in some setups |
| 4 | Safety evaluation lab | Score fixed cases for helpfulness, honesty, and safety boundaries |
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Boundary
- helpful, honest, safe behavior definition
- Risk Case
- one output that is fluent but unsafe or misaligned
- Evaluation
- fixed safety cases and expected decisions
- Method Map
- SFT, RLHF, DPO, constitutional or eval guardrail
- Bridge
- app reliability includes safety boundaries, not only capability
Pass Check
Section titled “Pass Check”You pass this chapter when you can explain the difference between capability and behavior, and when you can build a small behavior comparison log instead of judging one answer by impression.
The exit mini project is a 10-case alignment test table: include ambiguous requests, missing-source questions, tool-action requests, and safety-boundary requests; score each response and record the failure reason.
Check reasoning and explanation
- A passing answer explains how tokens, context, attention, prompts, and generation behavior connect in one request-response path.
- The evidence should include at least one reproducible prompt or structured-output test, plus notes on why the output passed or failed.
- A good self-check separates prompt design, RAG, fine-tuning, and alignment: use the lightest method that fixes the observed problem.