6.7.3 Training Monitoring and Diagnosis

Section Overview

Training diagnosis means separating symptoms from root causes. Do not switch models first. Read curves, inspect predictions and gradients, check data, then choose one targeted fix.

Learning Objectives

Classify underfitting, overfitting, and unstable training from curves.
Inspect prediction distribution and gradient norm.
Use a repeatable troubleshooting order.
Decide one next experiment from evidence.
Know what to save in every training run.

Look at the Curves First

Training curve diagnosis chart

The first question is not “which model should I switch to?” It is:

what symptom is visible in the training evidence?

Symptom	Likely direction	First check
train and val both bad	underfitting	learning rate, model capacity, data quality
train improves but val worsens	overfitting	regularization, data split, augmentation
loss jumps up and down	instability	learning rate, batch size, gradients
predictions mostly one class	collapse or data issue	labels, class balance, output layer
metrics suddenly change	pipeline bug or distribution shift	data loader, preprocessing, validation split

Training diagnosis dashboard troubleshooting map

Lab 1: Classify Curve Patterns

histories = {
    "underfit_case": ([1.20, 1.08, 0.99, 0.94], [1.25, 1.13, 1.04, 1.02]),
    "overfit_case": ([0.90, 0.55, 0.31, 0.18], [0.92, 0.63, 0.68, 0.82]),
    "unstable_case": ([0.80, 1.65, 0.72, 1.48], [0.85, 1.70, 0.79, 1.55]),
}


def diagnose(train, val):
    train_drop = train[0] - train[-1]
    val_best = min(val)

    if max(train) - min(train) > 0.8:
        return "possible_lr_too_high_or_unstable_batches"
    if train[-1] > 0.8 and val[-1] > 0.8:
        return "possible_underfitting"
    if train_drop > 0.3 and val[-1] > val_best + 0.1:
        return "possible_overfitting"
    return "need_more_signals"


print("curve_diagnosis")
for name, (train, val) in histories.items():
    print(name, "->", diagnose(train, val))

Expected output:

curve_diagnosis
underfit_case -> possible_underfitting
overfit_case -> possible_overfitting
unstable_case -> possible_lr_too_high_or_unstable_batches

This code is not a replacement for judgment. It teaches the first habit: classify the visible symptom before changing the system.

Lab 2: Check Gradients and Prediction Distribution

Loss alone is not enough. A model can have a reasonable loss while predicting the same class for every sample.

import torch
from torch import nn

torch.manual_seed(5)

X = torch.randn(12, 3)
y = torch.tensor([0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0])

model = nn.Sequential(nn.Linear(3, 4), nn.ReLU(), nn.Linear(4, 2))
loss_fn = nn.CrossEntropyLoss()

logits = model(X)
loss = loss_fn(logits, y)
loss.backward()

grad_norm = 0.0
for p in model.parameters():
    if p.grad is not None:
        grad_norm += p.grad.pow(2).sum().item()
grad_norm = grad_norm**0.5

preds = logits.argmax(dim=1)
counts = torch.bincount(preds, minlength=2)
confidence = torch.softmax(logits, dim=1).max(dim=1).values.mean().item()

print("training_signals")
print("loss:", round(loss.item(), 3))
print("grad_norm:", round(grad_norm, 3))
print("pred_counts:", counts.tolist())
print("avg_confidence:", round(confidence, 3))

Expected output:

training_signals
loss: 0.687
grad_norm: 0.445
pred_counts: [0, 12]
avg_confidence: 0.69

Training diagnosis signal result map

The important signal is pred_counts: [0, 12]. This initial model predicts class 1 for every sample. During real training, if this pattern persists, check class imbalance, labels, output layer shape, and loss setup.

A Troubleshooting Order

Use this order before changing the architecture:

Curves: train/val loss and metrics.
Predictions: class counts, confidence, best and worst examples.
Gradients: norm, NaN/Inf, exploding or near-zero updates.
Data: labels, leakage, split, preprocessing, augmentation.
Hyperparameters: learning rate, batch size, regularization.
Model: capacity, architecture, initialization.

This order is deliberately boring. That is why it works.

What to Save During Training

Artifact	Why save it
train/val curves	diagnose trend and overfitting
config and seed	reproduce the run
best checkpoint	compare without retraining
prediction samples	inspect failures directly
gradient statistics	catch instability early
data split version	detect leakage or drift

Diagnosis to Action

Diagnosis	First action
possible underfitting	raise LR within reason, train longer, increase capacity, inspect labels
possible overfitting	early stopping, stronger regularization, more data, augmentation
unstable training	lower LR, increase batch, add gradient clipping
prediction collapse	check class balance, target encoding, output shape, loss function
data pipeline issue	print sample batches, verify preprocessing and split

Common Mistakes

Mistake	Fix
only reading final accuracy	save full curves and best epoch
changing model before checking data	inspect sample batches and labels first
ignoring prediction distribution	print class counts or output summaries
assuming low train loss means success	compare validation and failure cases
making multiple fixes at once	choose one action and verify the result

Exercises

Add a good_case history where train and val both improve.
Modify Lab 2 so the model has 3 classes. What changes in torch.bincount?
Add a check that reports has_nan_grad.
Write one next action for each diagnosis in Lab 1.
Save a CSV-style log with epoch,train_loss,val_loss,val_acc.

Key Takeaways

Symptoms are not root causes.
Curves are the first diagnostic surface.
Predictions and gradients reveal failures that loss can hide.
Data checks come before architecture changes.
A good diagnosis ends with one targeted next experiment.

Learning Objectives​

Look at the Curves First​

Lab 1: Classify Curve Patterns​

Lab 2: Check Gradients and Prediction Distribution​

A Troubleshooting Order​

What to Save During Training​

Diagnosis to Action​

Common Mistakes​

Exercises​

Key Takeaways​