Skip to content

5.4.4 Bias-Variance Tradeoff

Three-panel bias-variance tradeoff diagram

This lesson uses decision trees to show:

  • how model complexity changes train and test scores;
  • how to identify underfitting and overfitting from the train-test gap;
  • how learning curves show whether more data may help;
  • what action to take for high bias vs high variance.

Bias-variance action diagnosis map

Terminal window
python -m pip install -U scikit-learn numpy

Create bias_variance_lab.py:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import learning_curve, train_test_split
from sklearn.tree import DecisionTreeClassifier
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
print("complexity_lab")
for depth in [1, 3, 5, None]:
model = DecisionTreeClassifier(max_depth=depth, random_state=42)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
gap = train_acc - test_acc
print(
f"max_depth={str(depth):<4} "
f"train={train_acc:.3f} test={test_acc:.3f} gap={gap:.3f} "
f"leaves={model.get_n_leaves()}"
)
print("learning_curve_lab")
model = DecisionTreeClassifier(max_depth=3, random_state=42)
train_sizes, train_scores, val_scores = learning_curve(
model,
X,
y,
cv=5,
train_sizes=[0.2, 0.4, 0.6, 0.8, 1.0],
scoring="accuracy",
)
for size, train_mean, val_mean in zip(train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)):
print(f"train_size={size:<3} train={train_mean:.3f} cv={val_mean:.3f} gap={train_mean - val_mean:.3f}")

Run it:

Terminal window
python bias_variance_lab.py

Expected output:

Terminal window
complexity_lab
max_depth=1 train=0.923 test=0.923 gap=-0.001 leaves=2
max_depth=3 train=0.977 test=0.944 gap=0.032 leaves=7
max_depth=5 train=0.995 test=0.937 gap=0.058 leaves=15
max_depth=None train=1.000 test=0.923 gap=0.077 leaves=18
learning_curve_lab
train_size=91 train=0.989 cv=0.847 gap=0.142
train_size=182 train=0.986 cv=0.870 gap=0.116
train_size=273 train=0.978 cv=0.903 gap=0.075
train_size=364 train=0.975 cv=0.917 gap=0.057
train_size=455 train=0.974 cv=0.919 gap=0.055

Bias-variance lab result map

The tree becomes more complex as max_depth grows:

max_depth=1 train=0.923 test=0.923 gap=-0.001 leaves=2
max_depth=None train=1.000 test=0.923 gap=0.077 leaves=18

max_depth=1 is simple. Train and test are similar, but the score is not the best. This can be high bias: the model may be too simple.

max_depth=None memorizes training data perfectly, but test accuracy drops. This is high variance: the model fits training details that do not generalize.

The best practical model is often in the middle:

max_depth=3 train=0.977 test=0.944 gap=0.032

It is not perfect on training data, but it generalizes better.

Learning curve diagnosis map

The learning curve shows what happens when more training data is used:

train_size=91 train=0.989 cv=0.847 gap=0.142
train_size=455 train=0.974 cv=0.919 gap=0.055

As data grows, validation score improves and the gap shrinks. That suggests more data may help, but the model still has room for better features or tuning.

PatternLikely problemTry
train low, validation lowhigh bias / underfittingricher model, better features, less regularization
train high, validation lowhigh variance / overfittingsimpler model, more regularization, more data
train high, validation highgood fittest on final holdout and monitor drift
validation varies a lot by foldinstability or data segmentsinspect folds, add data, use robust models

Do not diagnose from one metric alone. Check train score, validation score, gap, and whether the mistakes are concentrated in a segment.

For high bias:

  • add useful features;
  • use a more expressive model;
  • reduce overly strong regularization;
  • train longer if the model is iterative.

For high variance:

  • reduce model complexity;
  • increase regularization;
  • collect more diverse data;
  • use cross-validation and a final holdout;
  • consider ensembles that reduce variance.
SymptomLikely causeFix
Both train and validation are poormodel cannot express the patternimprove features or model class
Train is perfect, validation is worseoverfittinglimit depth, prune, regularize
More data improves validationvariance or data scarcitycollect more representative data
More data does not helphigh bias or noisy labelsimprove features, labels, or model
Validation score jumps by folddata is uneveninspect segment distribution
  1. Add min_samples_leaf=5 to the tree. How does the gap change?
  2. Try max_depth=2, 4, 6, 8. Where does test accuracy peak?
  3. Replace the tree with logistic regression. Is the issue bias or variance?
  4. Use 5-fold cross-validation instead of one test split for the complexity lab.
  5. Inspect mistakes for the best tree. Are errors concentrated in one class?
Reference implementation and walkthrough
  1. min_samples_leaf=5 usually narrows the train/test gap because the tree cannot memorize tiny leaves as easily. If both scores fall, the model may now be too simple.
  2. Test accuracy often peaks at a middle depth. Very shallow trees underfit; very deep trees overfit even when training accuracy keeps rising.
  3. Logistic regression is a useful bias check. If it performs similarly and both models score low, features or model class may be too weak; if the tree trains much better but tests worse, variance is the main issue.
  4. Five-fold CV gives a more stable complexity comparison than one lucky split. Choose the depth with the best mean score and acceptable variance.
  5. Concentrated errors suggest a class-specific feature gap, label ambiguity, or imbalance problem. Randomly scattered errors suggest the remaining noise may be harder to remove.

Keep this page’s proof of learning as a small evidence card:

Evaluation Setup
split, cross-validation, metric, baseline, and comparison target
Result
score table, curve, confusion matrix, validation result, or search outcome
Decision
whether to change data, features, model, threshold, or hyperparameters
Failure Check
leakage, unstable validation, wrong metric, or tuning on the test set
Expected Output
evaluation record that supports a next modeling decision

You are done when you can explain:

  • high bias means the model is too simple or missing signal;
  • high variance means the model is too sensitive to training details;
  • the train-validation gap is a practical diagnostic;
  • learning curves show whether more data may help;
  • the fix depends on the pattern, not the vocabulary.