Skip to content

11.2.4 Fundamentals of Language Models

Language model next token prediction diagram

  • Understand the most basic task objective of language models
  • Understand the continuity between n-gram language models and modern neural language models
  • Build intuition for “predicting the next token” through a runnable example
  • Understand why language models become the shared foundation of later large models

In one sentence:

  • Given the previous text, predict the next token

For example:

  • “I love” -> the next word might be AI, you, Python

Why does this task look simple but powerful?

Section titled “Why does this task look simple but powerful?”

Because to do it well, the model must gradually learn:

  • lexical collocations
  • grammatical structure
  • common semantic relationships
  • some world knowledge

In other words, although “predict the next token” is a simple objective, it pushes the model to learn many language patterns underneath.

A language model is like playing a word chain game, but not just any continuation—it has to continue in a way that is:

  • like human language
  • like the current context
  • like a reasonable extension

You can first understand it as:

  • only looking at a very short history
  • using statistical frequency to predict what comes next

For example, bigram:

  • only looks at the previous 1 word

trigram:

  • only looks at the previous 2 words
  • intuitive
  • interpretable
  • easy to get started with
  • cannot see long-distance dependencies
  • very sparse
  • weak generalization

But it is very suitable for helping beginners build the first layer of intuition about language models.


from collections import defaultdict, Counter
corpus = [
"I love AI",
"I love Python",
"You love NLP",
]
stats = defaultdict(Counter)
for sent in corpus:
toks = sent.split()
for a, b in zip(toks[:-1], toks[1:]):
stats[a][b] += 1
print(dict(stats))

Expected output:

Terminal window
{'I': Counter({'love': 2}), 'love': Counter({'AI': 1, 'Python': 1, 'NLP': 1}), 'You': Counter({'love': 1})}

Read this as a tiny next-token table: after I, the next token was love twice; after love, three different next tokens appeared once each.

What is the most important value of this code?

Section titled “What is the most important value of this code?”

It peels back the lowest-level logic of a language model:

  • after seeing a word
  • how many times each possible next word appeared in the training corpus

Why is this already like a “language model”?

Section titled “Why is this already like a “language model”?”

Because it is already doing:

  • conditional probability estimation

For example, after seeing:

  • love

the following words:

  • AI
  • Python
  • NLP

can each have different probabilities.


How do we move from statistical models to neural language models?

Section titled “How do we move from statistical models to neural language models?”

Although model architectures become more and more complex later, one important fact remains:

  • the objective function is often still “predict the next token”

What changes is the representation and generalization

Section titled “What changes is the representation and generalization”

Neural language models no longer just look up a frequency table, but instead:

  • represent tokens as vectors
  • model context with neural networks

This allows them to:

  • see longer histories
  • learn more abstract patterns
  • generalize better to unseen combinations

A simplified example of a “prediction distribution”

Section titled “A simplified example of a “prediction distribution””
import math
scores = {
"AI": 2.0,
"Python": 1.5,
"NLP": 0.8,
}
def softmax(score_dict):
exps = {k: math.exp(v) for k, v in score_dict.items()}
total = sum(exps.values())
return {k: round(v / total, 4) for k, v in exps.items()}
print(softmax(scores))

Expected output:

Terminal window
{'AI': 0.5242, 'Python': 0.3179, 'NLP': 0.1579}

The model does not have to choose immediately. It first produces a probability distribution, then a decoding rule can choose, sample, or rank candidate next tokens.

This is not a complete neural network, but it already expresses one key idea:

  • the model does not output just one word
  • it outputs a “probability distribution over the next word”

Why do language models become the common foundation of large models?

Section titled “Why do language models become the common foundation of large models?”

Whether you later do:

  • conversation
  • writing
  • code generation
  • summarization

many capabilities can grow out of “language continuation ability.”

Because it is well suited to large-scale self-supervised learning

Section titled “Because it is well suited to large-scale self-supervised learning”

You do not need human annotation for “what the next word is,” because the text itself naturally contains the label.

This means:

  • massive text data
  • self-supervised training

can be combined naturally.

This is also why the later path leads to GPT

Section titled “This is also why the later path leads to GPT”

Because autoregressive language modeling is:

  • simple
  • unified
  • scalable

This path later became one of the important main lines of large language models.


Misconception 1: a language model is only “good at continuing the next word”

Section titled “Misconception 1: a language model is only “good at continuing the next word””

This statement is superficially true, but it underestimates how much the model can be pushed to learn by this task.

Misconception 2: n-gram is useless, so there is no need to learn it

Section titled “Misconception 2: n-gram is useless, so there is no need to learn it”

n-gram is very useful, because it lets you see for the first time what a language model is actually doing.

Misconception 3: if it can generate, then it understands language

Section titled “Misconception 3: if it can generate, then it understands language”

Strong generation ability does not mean full understanding. That is also why later we still need to look at reasoning, alignment, and tool use.


Keep this page’s proof of learning as a small evidence card:

Representation
BoW, TF-IDF, static embedding, contextual embedding, or language-model score
Comparison
nearest text, similarity score, or next-token/log-prob style output
Interpretation
what the representation captures and what it misses
Failure Check
polysemy, domain mismatch, short text, tokenization, or semantic drift
Expected Output
small comparison table with at least one surprising result

The most important thing in this lesson is to form a stable judgment:

The most basic task of a language model is to predict the next token given the previous context; and this seemingly simple objective is exactly what forms the foundation for many capabilities of later large models.

Once this main thread is clear, you will naturally find it much easier to understand GPT, pretraining, and generative models later.


  1. Add a few more sentences to the corpus and see how stats changes.
  2. Why can we say bigram is simple, yet it already captures the core of a language model?
  3. Explain in your own words: why is a language model naturally suited to large-scale self-supervised training?
  4. Think about it: why can the ability to “continue the next word” eventually grow into conversation and writing abilities?
Reference implementation and walkthrough
  1. Adding corpus sentences changes transition counts in stats; common continuations become more likely and rare continuations may disappear in comparison.
  2. Bigram is simple, but it already contains the core language-modeling idea: estimate what token is likely next from previous context.
  3. Language modeling is naturally self-supervised because ordinary text already provides input context and the next-token target.
  4. Next-word prediction grows into writing and conversation when scale, representation learning, instruction tuning, feedback, and long context are added.