6.1.5 Optimizers

Optimizer path comparison

What You Will Build

This lesson runs a tiny PyTorch optimization lab:

compare SGD, Momentum, and Adam on the same simple loss;
see overshooting directly;
test learning-rate sensitivity;
learn a safe optimizer choice order.

Optimizer decision map from gradients to parameter updates

Setup

python -m pip install -U torch

Run the Complete Lab

Create optimizer_lab.py:

import torch


def run_optimizer(name, optimizer_factory, steps=25):
    torch.manual_seed(42)
    w = torch.nn.Parameter(torch.tensor([5.0]))
    optimizer = optimizer_factory([w])
    for step in range(1, steps + 1):
        loss = (w - 2).pow(2).mean()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if step in [1, 5, 10, 25]:
            print(f"{name:<8} step={step:<2} w={w.item():.3f} loss={loss.item():.4f}")


print("optimizer_comparison")
run_optimizer("sgd", lambda params: torch.optim.SGD(params, lr=0.1))
run_optimizer("momentum", lambda params: torch.optim.SGD(params, lr=0.1, momentum=0.9))
run_optimizer("adam", lambda params: torch.optim.Adam(params, lr=0.1))

print("learning_rate_check")
for lr in [0.01, 0.1, 1.1]:
    torch.manual_seed(42)
    w = torch.nn.Parameter(torch.tensor([5.0]))
    optimizer = torch.optim.SGD([w], lr=lr)
    for _ in range(10):
        loss = (w - 2).pow(2).mean()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    final_loss = (w - 2).pow(2).item()
    print(f"lr={lr:<4} final_w={w.item():.3f} final_loss={final_loss:.4f}")

Run it:

python optimizer_lab.py

Expected output:

optimizer_comparison
sgd      step=1  w=4.400 loss=9.0000
sgd      step=5  w=2.983 loss=1.5099
sgd      step=10 w=2.322 loss=0.1621
sgd      step=25 w=2.011 loss=0.0002
momentum step=1  w=4.400 loss=9.0000
momentum step=5  w=0.259 loss=0.8571
momentum step=10 w=2.013 loss=0.6767
momentum step=25 w=2.475 loss=0.0200
adam     step=1  w=4.900 loss=9.0000
adam     step=5  w=4.502 loss=6.7648
adam     step=10 w=4.014 loss=4.4535
adam     step=25 w=2.739 loss=0.6569
learning_rate_check
lr=0.01 final_w=4.451 final_loss=6.0085
lr=0.1  final_w=2.322 final_loss=0.1038
lr=1.1  final_w=20.575 final_loss=345.0386

Optimizer lab result dashboard

Read the Experiment

The loss is:

loss = (w - 2)^2

The best value is w=2. All optimizers start from w=5.

In this simple example, SGD with a good learning rate works extremely well:

sgd step=25 w=2.011 loss=0.0002

Momentum moves faster, but it can overshoot:

momentum step=5 w=0.259

Adam is a very common default in deep learning, but it is not magic. With lr=0.1 on this tiny problem, it moves more slowly than tuned SGD. The lesson is not “Adam is bad”; the lesson is:

Always inspect training behavior. Optimizer choice and learning rate work together.

Learning Rate Is the First Knob

The learning-rate check is intentionally blunt:

lr=0.01 final_w=4.451 final_loss=6.0085
lr=0.1  final_w=2.322 final_loss=0.1038
lr=1.1  final_w=20.575 final_loss=345.0386

Too small: training crawls.

Reasonable: training approaches the optimum.

Too large: training diverges.

Evidence to Keep

Keep one optimizer comparison table in your notes:

Same Loss: (w - 2)^2
Same Start: w = 5
Sgd Result: approaches w = 2 with lr=0.1
Momentum Result: moves faster but overshoots
Bad Lr Result: lr=1.1 diverges

This evidence is more useful than memorizing optimizer names. It shows the real rule: gradients give direction, while optimizer settings decide the size and style of movement.

Optimizer Intuition

Optimizer	Intuition	Good first use
SGD	move directly against the gradient	simple baseline, controlled experiments
SGD + Momentum	keep velocity from previous steps	smoother progress in noisy directions
Adam	adapt step sizes using gradient history	strong default for many neural networks

For real neural networks, Adam or AdamW is often a practical starting point. For final training, always compare with the task’s validation metric.

Practical Selection Order

Start with Adam or AdamW for a neural network baseline.
Tune learning rate before arguing about optimizer names.
Watch training and validation loss curves.
If validation is unstable, lower LR or add scheduling.
If training is slow but stable, try LR schedule or optimizer change.

Practical Debugging Checklist

Symptom	Likely cause	Fix
loss explodes	learning rate too high	lower LR
loss decreases too slowly	LR too low or poor scaling	raise LR carefully, normalize inputs
training loss drops but validation worsens	overfitting	regularize, add data, stop earlier
loss oscillates	momentum/LR too aggressive	lower LR or momentum
Adam works but final quality is weak	optimizer hides other issues	check data, architecture, regularization

Practice

Change SGD learning rate to 0.05, 0.2, and 0.8.
Change momentum from 0.9 to 0.5. Does overshooting reduce?
Try AdamW instead of Adam.
Print w.grad each step to connect gradients with updates.
Plot w over steps for each optimizer.

Reference implementation and walkthrough

A smaller learning rate should move more slowly; a moderate one should converge faster; a very large one may bounce around or diverge.
Lower momentum usually reduces overshooting, but it can also remove useful acceleration. Compare both the path of w and the final loss.
AdamW often behaves like Adam on this toy problem, but its weight decay is decoupled from the adaptive update. That distinction matters more in larger models.
w.grad points in the direction that would increase loss, so the optimizer usually moves parameters against that gradient. The update size depends on optimizer state and learning rate.
The w plot should show whether the optimizer crawls, heads smoothly to the optimum, overshoots, or oscillates. This is easier to trust than a single final number.

Pass Check

You are done when you can explain:

gradients say which direction changes loss;
optimizers decide how far parameters move;
learning rate can make training crawl, converge, or diverge;
momentum can speed movement but can overshoot;
Adam is useful, but not a substitute for checking curves.