7.3.1 Transformer Deep Dive Roadmap: Blocks, Masks, Cost

This chapter looks inside the Transformer enough to debug LLM behavior and understand why context length, attention, KV cache, and model variants matter.

Look at the Internal Flow First

Transformer deep-dive chapter relationship diagram

Transformer information flow, computation cost, and task fit diagram

Build a Causal Mask

seq_len = 4
mask = []
for query_pos in range(seq_len):
    row = []
    for key_pos in range(seq_len):
        row.append("allow" if key_pos <= query_pos else "block")
    mask.append(row)

for row in mask:
    print(row)

Expected output:

['allow', 'block', 'block', 'block']
['allow', 'allow', 'block', 'block']
['allow', 'allow', 'allow', 'block']
['allow', 'allow', 'allow', 'allow']

Causal mask run result map

Generation uses this "no future peeking" rule: a token can attend to earlier tokens, but not future tokens.

Learn in This Order

Order	Read	What to focus on
1	7.3.2 Architecture Review	attention, residual, normalization
2	7.3.3 Modern Decoder Block	decoder-only LLM block
3	7.3.4 Model Variants	encoder, decoder, encoder-decoder
4	7.3.5 Efficient Attention	KV cache, MQA/GQA, long context
5	7.3.6 Scale and Computation	cost, latency, memory

Pass Check

You pass this roadmap when you can explain why decoder-only models need a causal mask, why attention gets expensive as context grows, and why KV cache helps generation.

Look at the Internal Flow First​

Build a Causal Mask​

Learn in This Order​

Pass Check​

Look at the Internal Flow First

Build a Causal Mask

Learn in This Order

Pass Check