Skip to main content

7.2.2 Development History of Large Models

15-stage AI development map

Read This Page as a Map, Not a Memory Test

You do not need to memorize dates. Keep one line in mind:

rules -> statistics -> neural representations -> attention -> scale -> alignment -> tools

Large language models are the result of this long shift, not a sudden invention.

The 15-Stage Big Picture

StageWhat changedWhy it matters to LLMs
1. Turing questionmachine intelligence became a concrete questionlanguage became a key test of intelligence
2. Dartmouth AIAI became a research fieldsymbolic reasoning dominated early thinking
3. Perceptronneural learning appearedfirst wave of trainable models
4. Expert systemsrules scaled inside narrow domainsshowed both value and maintenance pain
5. Backpropagationmultilayer neural nets became trainablefoundation for deep learning
6. LeNetneural nets worked on real perception tasksshowed representation learning in practice
7. Statistical MLdata-driven methods beat many hand rulesNLP moved toward corpus evidence
8. ImageNet / AlexNetdeep learning won at scaledata + compute + architecture mattered
9. ResNetvery deep networks became trainablescale became more reliable
10. RNN / LSTMsequences became neurallanguage modeling moved beyond n-grams
11. Attentionmodels could focus on relevant positionssolved part of long-context bottleneck
12. Transformerattention became the main architectureparallel training and scaling took off
13. BERT / GPTpretraining became the shared foundationone model could transfer to many tasks
14. RLHF / ChatGPTbehavior was aligned with instructionsmodel capability became product behavior
15. RAG / Agentmodels used knowledge and toolsLLMs became application systems

Now zoom in on the language-model line.

Five Language-Model Eras

EraCore ideaMain limitation
Rule-based systemshumans write language rulesbrittle and expensive to maintain
Statistical language modelsnext word follows observed frequencysparse data and short context
Neural sequence modelslearn vectors and recurrent statehard to train long dependencies
Transformersevery token can attend to relevant tokenscompute and data cost are high
LLM + alignmentscale pretraining, then tune behaviorhallucination, safety, cost, evaluation

The through line is context. Each era tried to use more context with fewer brittle assumptions.

Lab: Build a Bigram Language Model

This small n-gram model predicts the next word from the current word. It is not powerful, but it shows the statistical idea that came before neural LMs.

from collections import Counter, defaultdict

corpus = [
"I like learning AI",
"I like learning Python",
"You like learning NLP",
"I like doing projects",
]

next_word_counter = defaultdict(Counter)

for sentence in corpus:
tokens = sentence.split()
for current_word, next_word in zip(tokens[:-1], tokens[1:]):
next_word_counter[current_word][next_word] += 1


def suggest_next(word):
candidates = next_word_counter[word]
return candidates.most_common() if candidates else []


print("Common words after I :", suggest_next("I"))
print("Common words after like :", suggest_next("like"))
print("Common words after learning:", suggest_next("learning"))

Expected output:

Common words after I       : [('like', 3)]
Common words after like : [('learning', 3), ('doing', 1)]
Common words after learning: [('AI', 1), ('Python', 1), ('NLP', 1)]

Bigram language model result map

This already feels like autocomplete. But it has three obvious limits:

  • it only looks one word back;
  • rare combinations have weak statistics;
  • it has no semantic representation of the sentence.

Why Neural Models Mattered

Neural language models replaced raw counting with learned representations:

word id -> vector -> context state -> prediction

Word2Vec, GloVe, RNN, LSTM, and GRU made language modeling more flexible. They helped models learn similarity and longer context, but sequential reading still made training slow and long-range memory fragile.

Why Transformer Was the Turning Point

RNNs read mainly step by step. Transformers let tokens directly compare with other tokens through attention:

current token -> attends to relevant tokens -> updated representation

That changed three things:

  • training could be more parallel;
  • long-range relationships became easier to model;
  • scaling parameters, data, and compute became more effective.

This is why BERT, GPT, T5, and later LLMs share the Transformer family tree.

Why Scale Was Not Enough

Large-scale pretraining made models broadly capable, but product behavior still needed another layer:

NeedTechnique
follow instructionsinstruction tuning
prefer helpful responsespreference learning / RLHF
use current private knowledgeRAG
perform actionstool calling / Agent loop
reduce unsafe behaviorsafety evaluation and guardrails

This is the key modern distinction:

model capability != model behavior

A model can be powerful and still fail to follow policy, cite evidence, or act safely.

What to Remember

Large models belong to NLP history, but they now exceed a narrow NLP boundary. The same architecture and training ideas are used for text, image, speech, code, video, multimodal QA, RAG, and agents.

The practical lesson is:

  • rules gave control but poor coverage;
  • statistics gave data evidence but short context;
  • neural representations gave semantic space;
  • Transformer made scale practical;
  • alignment, RAG, and tools turned models into systems.

Exercises

  1. Add two sentences to the bigram corpus and observe how suggestions change.
  2. Why does a bigram model fail on long instructions?
  3. Explain why Transformer training is easier to parallelize than RNN training.
  4. Give one example where a model has capability but still needs alignment or RAG.
  5. Pick one of the 15 stages and explain how it still affects today’s LLM applications.