11.2.4 Fundamentals of Language Models

Learning Objectives
Section titled “Learning Objectives”- Understand the most basic task objective of language models
- Understand the continuity between n-gram language models and modern neural language models
- Build intuition for “predicting the next token” through a runnable example
- Understand why language models become the shared foundation of later large models
What exactly does a language model learn?
Section titled “What exactly does a language model learn?”The most basic form
Section titled “The most basic form”In one sentence:
- Given the previous text, predict the next token
For example:
- “I love” -> the next word might be
AI,you,Python
Why does this task look simple but powerful?
Section titled “Why does this task look simple but powerful?”Because to do it well, the model must gradually learn:
- lexical collocations
- grammatical structure
- common semantic relationships
- some world knowledge
In other words, although “predict the next token” is a simple objective, it pushes the model to learn many language patterns underneath.
An analogy
Section titled “An analogy”A language model is like playing a word chain game, but not just any continuation—it has to continue in a way that is:
- like human language
- like the current context
- like a reasonable extension
Start with the n-gram intuition
Section titled “Start with the n-gram intuition”What is an n-gram language model?
Section titled “What is an n-gram language model?”You can first understand it as:
- only looking at a very short history
- using statistical frequency to predict what comes next
For example, bigram:
- only looks at the previous 1 word
trigram:
- only looks at the previous 2 words
What are the advantages of this method?
Section titled “What are the advantages of this method?”- intuitive
- interpretable
- easy to get started with
Its limitations are also obvious
Section titled “Its limitations are also obvious”- cannot see long-distance dependencies
- very sparse
- weak generalization
But it is very suitable for helping beginners build the first layer of intuition about language models.
Run a simple bigram example first
Section titled “Run a simple bigram example first”from collections import defaultdict, Counter
corpus = [ "I love AI", "I love Python", "You love NLP",]
stats = defaultdict(Counter)
for sent in corpus: toks = sent.split() for a, b in zip(toks[:-1], toks[1:]): stats[a][b] += 1
print(dict(stats))Expected output:
{'I': Counter({'love': 2}), 'love': Counter({'AI': 1, 'Python': 1, 'NLP': 1}), 'You': Counter({'love': 1})}Read this as a tiny next-token table: after I, the next token was love twice; after love, three different next tokens appeared once each.
What is the most important value of this code?
Section titled “What is the most important value of this code?”It peels back the lowest-level logic of a language model:
- after seeing a word
- how many times each possible next word appeared in the training corpus
Why is this already like a “language model”?
Section titled “Why is this already like a “language model”?”Because it is already doing:
- conditional probability estimation
For example, after seeing:
love
the following words:
AIPythonNLP
can each have different probabilities.
How do we move from statistical models to neural language models?
Section titled “How do we move from statistical models to neural language models?”The core task has not changed
Section titled “The core task has not changed”Although model architectures become more and more complex later, one important fact remains:
- the objective function is often still “predict the next token”
What changes is the representation and generalization
Section titled “What changes is the representation and generalization”Neural language models no longer just look up a frequency table, but instead:
- represent tokens as vectors
- model context with neural networks
This allows them to:
- see longer histories
- learn more abstract patterns
- generalize better to unseen combinations
A simplified example of a “prediction distribution”
Section titled “A simplified example of a “prediction distribution””import math
scores = { "AI": 2.0, "Python": 1.5, "NLP": 0.8,}
def softmax(score_dict): exps = {k: math.exp(v) for k, v in score_dict.items()} total = sum(exps.values()) return {k: round(v / total, 4) for k, v in exps.items()}
print(softmax(scores))Expected output:
{'AI': 0.5242, 'Python': 0.3179, 'NLP': 0.1579}The model does not have to choose immediately. It first produces a probability distribution, then a decoding rule can choose, sample, or rank candidate next tokens.
This is not a complete neural network, but it already expresses one key idea:
- the model does not output just one word
- it outputs a “probability distribution over the next word”
Why do language models become the common foundation of large models?
Section titled “Why do language models become the common foundation of large models?”Because this objective is general enough
Section titled “Because this objective is general enough”Whether you later do:
- conversation
- writing
- code generation
- summarization
many capabilities can grow out of “language continuation ability.”
Because it is well suited to large-scale self-supervised learning
Section titled “Because it is well suited to large-scale self-supervised learning”You do not need human annotation for “what the next word is,” because the text itself naturally contains the label.
This means:
- massive text data
- self-supervised training
can be combined naturally.
This is also why the later path leads to GPT
Section titled “This is also why the later path leads to GPT”Because autoregressive language modeling is:
- simple
- unified
- scalable
This path later became one of the important main lines of large language models.
The most common pitfalls
Section titled “The most common pitfalls”Misconception 1: a language model is only “good at continuing the next word”
Section titled “Misconception 1: a language model is only “good at continuing the next word””This statement is superficially true, but it underestimates how much the model can be pushed to learn by this task.
Misconception 2: n-gram is useless, so there is no need to learn it
Section titled “Misconception 2: n-gram is useless, so there is no need to learn it”n-gram is very useful, because it lets you see for the first time what a language model is actually doing.
Misconception 3: if it can generate, then it understands language
Section titled “Misconception 3: if it can generate, then it understands language”Strong generation ability does not mean full understanding. That is also why later we still need to look at reasoning, alignment, and tool use.
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Representation
- BoW, TF-IDF, static embedding, contextual embedding, or language-model score
- Comparison
- nearest text, similarity score, or next-token/log-prob style output
- Interpretation
- what the representation captures and what it misses
- Failure Check
- polysemy, domain mismatch, short text, tokenization, or semantic drift
- Expected Output
- small comparison table with at least one surprising result
Summary
Section titled “Summary”The most important thing in this lesson is to form a stable judgment:
The most basic task of a language model is to predict the next token given the previous context; and this seemingly simple objective is exactly what forms the foundation for many capabilities of later large models.
Once this main thread is clear, you will naturally find it much easier to understand GPT, pretraining, and generative models later.
Exercises
Section titled “Exercises”- Add a few more sentences to the corpus and see how
statschanges. - Why can we say bigram is simple, yet it already captures the core of a language model?
- Explain in your own words: why is a language model naturally suited to large-scale self-supervised training?
- Think about it: why can the ability to “continue the next word” eventually grow into conversation and writing abilities?
Reference implementation and walkthrough
- Adding corpus sentences changes transition counts in
stats; common continuations become more likely and rare continuations may disappear in comparison. - Bigram is simple, but it already contains the core language-modeling idea: estimate what token is likely next from previous context.
- Language modeling is naturally self-supervised because ordinary text already provides input context and the next-token target.
- Next-word prediction grows into writing and conversation when scale, representation learning, instruction tuning, feedback, and long context are added.