Skip to main content

9.1.6 From TD-Gammon to AlphaGo: How Reinforcement Learning Shaped Agents

Historical breakthroughs map from reinforcement learning to Agent

Section overview

Modern LLM Agents are not the same as reinforcement learning, but the concept of an Agent is deeply connected to the history of reinforcement learning.

In this section, we will first focus on three stories:

TD-Gammon showed that machines can get stronger through self-play, DQN showed that deep networks can learn policies from pixels and rewards, and AlphaGo showed that combining learning, search, and planning can break through complex games.

Why does an Agent course need reinforcement learning history?

An Agent cares about:

  • observing the state in an environment
  • deciding the next action
  • adjusting strategy based on feedback
  • planning for long-term goals

This is highly similar to the basic problems in reinforcement learning.

Reinforcement learning termAgent system term
statecurrent context, task state
actiontool call, response, planning step
rewarduser feedback, evaluation score, whether the task is complete
policydecision strategy, rules for tool use
environmentexternal systems, knowledge base, browser, code repository

So the history of reinforcement learning is not a side topic. It helps you understand why Agents need to care about feedback, planning, trial and error, and safety boundaries.

TD-Gammon: learning strategy from self-play

Around 1992, Gerald Tesauro’s TD-Gammon achieved a very strong level of play in backgammon using temporal-difference learning.

One especially attractive aspect was this:

The system did not merely imitate human game records, but improved its judgment through massive self-play and feedback from outcomes.

For beginners, you can think about it like this:

Ordinary supervised learningThe TD-Gammon style
Every step has a standard answerOften only the final win/lose result is available
The focus is on fitting labelsThe focus is on learning a long-term strategy
Data is usually provided by humansThe system can generate experience through self-play

This opened up an important idea for later reinforcement learning and game AI:

If a system can generate its own experience, it is not fully limited by manually labeled data.

DQN Atari: from pixels to actions

In 2015, DeepMind’s DQN achieved a breakthrough on Atari games. Its significance was that it combined deep learning and reinforcement learning:

  • the input was game-screen pixels
  • the output was the next action
  • feedback came from the game score

It is like teaching a model to learn games starting from “looking at the screen.”

Game screen -> neural network -> action -> score feedback -> policy update

Its inspiration for modern Agents is:

  • an Agent does not have to work only on static text
  • an Agent can take continuous actions in an environment
  • actions change the future state
  • evaluation does not always appear immediately after each step

This is also why evaluating Agents is more difficult than evaluating ordinary question-answering systems.

AlphaGo: combining learning, search, and planning

In 2016, AlphaGo defeated Lee Sedol, and many people felt AI’s breakthrough very directly for the first time.

The key to AlphaGo was not “one neural network simply playing Go,” but a combination of multiple abilities:

CapabilityRole in AlphaGoInspiration for Agents
policy networkjudges candidate next movesgenerates possible actions
value networkestimates how good the position isevaluates the current plan
Monte Carlo tree searchlooks a few moves ahead to see the resultplanning and search
self-playgenerates more training experienceimproves from feedback

For Agents, this is extremely important:

Strong systems are often not the result of one model working alone, but of models, search, tools, feedback, and constraints working together.

What does this line have to do with LLM Agents?

The core of modern LLM Agents is not necessarily a reinforcement learning algorithm, but they inherit many of reinforcement learning’s problems:

Classical RL problemLLM Agent version
How should reward be defined?How should task success, correct citations, and user satisfaction be measured?
Is exploration dangerous?Could a tool call accidentally delete files or send the wrong request?
How should long-term goals be broken down?How should multi-step tasks be planned, executed, and corrected?
How should the policy be evaluated?Agent benchmarks, log replay, human review

So when you later study ReAct, Plan-and-Execute, tool calling, and Agent evaluation, you can think of them as:

new implementations in the language-model era of the old problems of “action, feedback, and planning.”

Assigning historical milestones to course chapters

Historical milestoneProblem it solvedCorresponding course chapter
TD-GammonLearning strategy from self-play and long-term feedback9.1 Agent historical background, 9.2 reasoning and planning
DQN / AtariDeep networks learning actions from environmental feedback9.8 Agent evaluation, safety, and environment interaction
AlphaGoCombining learning, search, and planning into a strong system9.2 planning, 9.7 multi-Agent / complex systems
RLHFAdjusting model behavior using human preferencesChapter 7 alignment, 9.8 safety evaluation
ReActLetting the model alternate between reasoning and acting9.2 ReAct, 9.3 tool calling

The intuition you should have after this section

An Agent is not just “letting the model improvise.” It is more like a system that constantly balances the following:

  • goals
  • actions
  • environment
  • feedback
  • planning
  • safety constraints

The stories of TD-Gammon, DQN, and AlphaGo tell us: truly strong intelligent systems are usually not just good at answering questions—they can act in an environment and adjust their strategy based on feedback.