Skip to main content

7.4.1 Pretraining Roadmap: Data, Objective, Engineering

Pretraining is how a model first learns broad language patterns. The useful engineering view is: clean data, choose an objective, train at scale, track risk.

Look at the Pretraining Triangle First

Diagram of pretraining chapter relationships

Triangle diagram of pretraining data, objective, and engineering

PieceFirst question
datawhat text enters training and what must be filtered?
objectivewhat prediction task creates learning signal?
engineeringhow are scale, checkpoints, logs, and failures handled?
evaluationwhat can the model do, and where does it fail?

Create Next-Token Pairs

tokens = ["AI", "learns", "from", "text"]
pairs = list(zip(tokens[:-1], tokens[1:]))

for source, target in pairs:
print(f"{source} -> {target}")

Expected output:

AI -> learns
learns -> from
from -> text

This tiny example is the shape of next-token prediction. Real pretraining repeats this over massive text with careful data governance.

Learn in This Order

OrderReadWhat to focus on
17.4.2 Pretraining Datasources, filtering, deduplication, contamination
27.4.3 Pretraining Methodsnext-token prediction, loss, scaling
37.4.4 Pretraining Engineeringdistributed training, checkpoints, monitoring

Pass Check

You pass this roadmap when you can explain how data, objective, and engineering each affect the final model, and why contamination can make evaluation misleading.