Skip to content

7.3.1 Transformer Deep Dive Roadmap: Blocks, Masks, Cost

This chapter looks inside the Transformer enough to debug LLM behavior and understand why context length, attention, KV cache, and model variants matter.

Transformer deep-dive chapter relationship diagram

Transformer information flow, computation cost, and task fit diagram

Read the second map as a debugging route: attention flow, compute cost, and task fit should be checked together.

seq_len = 4
mask = []
for query_pos in range(seq_len):
row = []
for key_pos in range(seq_len):
row.append("allow" if key_pos <= query_pos else "block")
mask.append(row)
for row in mask:
print(row)

Expected output:

Terminal window
['allow', 'block', 'block', 'block']
['allow', 'allow', 'block', 'block']
['allow', 'allow', 'allow', 'block']
['allow', 'allow', 'allow', 'allow']

Causal mask run result map

Generation uses this “no future peeking” rule: a token can attend to earlier tokens, but not future tokens.

OrderReadWhat to focus on
17.3.2 Architecture Reviewattention, residual, normalization
27.3.3 Modern Decoder Blockdecoder-only LLM block
37.3.4 Model Variantsencoder, decoder, encoder-decoder
47.3.5 Efficient AttentionKV cache, MQA/GQA, long context
57.3.6 Scale and Computationcost, latency, memory

Keep this page’s proof of learning as a small evidence card:

Block Contract
[batch, seq, d_model] in and out
Mask Check
causal mask blocks future positions
Kv Cache Reason
inference reuses past keys and values
Compute Note
attention cost grows with sequence length
Bridge
these details explain latency and context limits in apps

You pass this roadmap when you can explain why decoder-only models need a causal mask, why attention gets expensive as context grows, and why KV cache helps generation.

Check reasoning and explanation
  1. A passing answer explains how tokens, context, attention, prompts, and generation behavior connect in one request-response path.
  2. The evidence should include at least one reproducible prompt or structured-output test, plus notes on why the output passed or failed.
  3. A good self-check separates prompt design, RAG, fine-tuning, and alignment: use the lightest method that fixes the observed problem.