5.4.5 Hyperparameter Tuning

What You Will Build

This lesson shows:

the difference between parameters and hyperparameters;
how to use GridSearchCV;
how to use RandomizedSearchCV when the search space grows;
how to keep a final holdout untouched;
how to avoid over-tuning.

Hyperparameter tuning validation workflow diagram

Keyword Decoder

Term	Practical meaning
parameter	learned by the model during `fit()`
hyperparameter	chosen by you before training, such as tree depth
search space	the candidate values you allow the search to try
CV score	cross-validation score used to choose settings
final holdout	untouched data used once after tuning
budget	number of combinations or trials you can afford

Setup

python -m pip install -U scikit-learn

Run the Complete Lab

Create tuning_lab.py:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold, train_test_split


X, y = load_breast_cancer(return_X_y=True)
X_train, X_final, y_train, y_final = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("grid_search_lab")
grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid={
        "n_estimators": [80, 160],
        "max_depth": [3, 5, None],
        "min_samples_leaf": [1, 3],
    },
    scoring="f1",
    cv=cv,
    n_jobs=-1,
)
grid.fit(X_train, y_train)
print("best_params=", grid.best_params_)
print(f"best_cv_f1={grid.best_score_:.3f}")
final_pred = grid.best_estimator_.predict(X_final)
print(
    f"final accuracy={accuracy_score(y_final, final_pred):.3f} "
    f"recall={recall_score(y_final, final_pred):.3f} "
    f"f1={f1_score(y_final, final_pred):.3f}"
)

print("random_search_lab")
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions={
        "n_estimators": [60, 100, 160, 220],
        "max_depth": [3, 5, 8, None],
        "min_samples_leaf": [1, 2, 3, 5],
        "max_features": ["sqrt", "log2", None],
    },
    n_iter=8,
    scoring="f1",
    cv=cv,
    random_state=42,
    n_jobs=-1,
)
random_search.fit(X_train, y_train)
print("best_params=", random_search.best_params_)
print(f"best_cv_f1={random_search.best_score_:.3f}")

print("top_3_grid_results")
rows = sorted(
    zip(grid.cv_results_["mean_test_score"], grid.cv_results_["params"]),
    key=lambda item: item[0],
    reverse=True,
)[:3]
for score, params in rows:
    print(f"score={score:.3f} params={params}")

Run it:

python tuning_lab.py

Expected output:

grid_search_lab
best_params= {'max_depth': 5, 'min_samples_leaf': 3, 'n_estimators': 80}
best_cv_f1=0.968
final accuracy=0.956 recall=0.972 f1=0.966
random_search_lab
best_params= {'n_estimators': 100, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 8}
best_cv_f1=0.972
top_3_grid_results
score=0.968 params={'max_depth': 5, 'min_samples_leaf': 3, 'n_estimators': 80}
score=0.968 params={'max_depth': 5, 'min_samples_leaf': 3, 'n_estimators': 160}
score=0.968 params={'max_depth': None, 'min_samples_leaf': 3, 'n_estimators': 160}

Parameters vs Hyperparameters

Random Forest learns split rules from data. Those learned split rules are parameters.

You choose settings such as:

n_estimators;
max_depth;
min_samples_leaf;
max_features.

Those are hyperparameters. They shape how learning happens.

Grid Search

Grid search tries every combination you list:

param_grid={
    "n_estimators": [80, 160],
    "max_depth": [3, 5, None],
    "min_samples_leaf": [1, 3],
}

This grid has 2 x 3 x 2 = 12 combinations. With 5-fold CV, that means 60 model fits.

Use grid search when:

the search space is small;
you understand which values are plausible;
you want a simple, repeatable baseline.

Random Search

Random search samples a limited number of combinations from a larger space:

n_iter=8

In the lab, it tried only 8 combinations but searched a wider space and found a slightly higher CV F1:

best_cv_f1=0.972

Use random search when:

there are many hyperparameters;
training is expensive;
you want to explore before designing a narrower grid.

Final Holdout

The final holdout is the part we did not use in CV search:

X_train, X_final, y_train, y_final = train_test_split(...)

After the search chooses the best settings, we evaluate once:

final accuracy=0.956 recall=0.972 f1=0.966

Do not keep changing the grid after looking at the final holdout. If you do, it stops being a final holdout and becomes part of tuning.

Reading Search Results

The top grid results are very close:

score=0.968 params={'max_depth': 5, 'min_samples_leaf': 3, 'n_estimators': 80}
score=0.968 params={'max_depth': 5, 'min_samples_leaf': 3, 'n_estimators': 160}

When scores tie, prefer the simpler or cheaper model. More trees or deeper trees are not automatically better.

Practical Tuning Strategy

Stage	Action
Start	Build a simple baseline with default settings
Diagnose	Check bias/variance and metric choice
First search	Use a small grid around important parameters
Wider search	Use random search when combinations explode
Final check	Evaluate once on untouched holdout
Production	monitor drift and retrain policy

For experienced readers: Bayesian tools such as Optuna are useful when each trial is expensive or the search space is large. They are not a replacement for clean validation design.

Practical Debugging Checklist

Symptom	Likely cause	Fix
Search takes too long	grid too large	reduce candidates, use random search
CV score rises but final holdout drops	over-tuning	simplify search, keep a fresh holdout
Best model is much more complex	metric difference is tiny	choose cheaper/simpler model
Different runs choose different params	unstable data or small folds	use repeated CV or inspect variance
Tuning does not help	model class or features are limiting	improve features or model family first

Practice

Change scoring from "f1" to "recall". Which parameters change?
Add max_depth=10 to the grid. Does it improve CV score?
Increase n_iter from 8 to 16. Does random search improve enough to justify the cost?
Print mean_fit_time from cv_results_ and choose a cheaper model when scores tie.
Add a final untouched test set to one earlier lesson that currently uses only CV.

Reference implementation and walkthrough

Optimizing recall may choose more aggressive parameters that catch more positives, even if precision or F1 falls. The “best” parameters depend on the business cost encoded in the scoring metric.
max_depth=10 helps only if the previous grid was underfitting. If CV score does not improve or variance grows, the deeper option should not be selected.
Doubling n_iter is worthwhile only when the score gain is meaningful relative to runtime. Small gains within noise are not worth a much slower workflow.
mean_fit_time helps break ties. If two candidates score nearly the same, prefer the faster or simpler model unless the slower one has a clear operational benefit.
The final untouched test set should be used once after tuning. It estimates how the selected workflow performs on data that did not shape model or hyperparameter choices.

Evidence to Keep

Keep this page’s proof of learning as a small evidence card:

Evaluation Setup: split, cross-validation, metric, baseline, and comparison target
Result: score table, curve, confusion matrix, validation result, or search outcome
Decision: whether to change data, features, model, threshold, or hyperparameters
Failure Check: leakage, unstable validation, wrong metric, or tuning on the test set
Expected Output: evaluation record that supports a next modeling decision

Pass Check

You are done when you can explain:

hyperparameters are chosen before training;
grid search is exhaustive over a small listed space;
random search is useful when the space is larger;
final holdout must not be used for repeated tuning;
tuning cannot rescue bad features or the wrong validation design.