5.4.3 Cross-Validation

K-Fold cross-validation split diagram

Section Overview

A single train-test split is a snapshot. Cross-validation gives you a more stable estimate by testing the model across several different validation folds.

What You Will Build

This lesson shows:

why one train-test split can be noisy;
how to use StratifiedKFold for classification;
how to evaluate several metrics with cross_validate;
why preprocessing must stay inside Pipeline;
when random K-Fold is wrong, especially for time series.

Cross-validation stable evaluation flowchart

Setup

python -m pip install -U scikit-learn numpy

Run the Complete Lab

Create cv_lab.py:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold, cross_validate, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


X, y = load_breast_cancer(return_X_y=True)

print("single_split_variance")
for seed in [1, 2, 3, 4, 5]:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=seed, stratify=y
    )
    model = Pipeline([
        ("scale", StandardScaler()),
        ("clf", LogisticRegression(max_iter=2000, random_state=42)),
    ])
    model.fit(X_train, y_train)
    print(f"seed={seed} accuracy={accuracy_score(y_test, model.predict(X_test)):.3f}")

print("cross_validation_lab")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000, random_state=42)),
])
result = cross_validate(
    model,
    X,
    y,
    cv=cv,
    scoring=["accuracy", "precision", "recall", "f1"],
)
for i, score in enumerate(result["test_accuracy"], start=1):
    print(f"fold={i} accuracy={score:.3f}")
print(
    "summary "
    f"accuracy={np.mean(result['test_accuracy']):.3f}+/-{np.std(result['test_accuracy']):.3f} "
    f"precision={np.mean(result['test_precision']):.3f} "
    f"recall={np.mean(result['test_recall']):.3f} "
    f"f1={np.mean(result['test_f1']):.3f}"
)

Run it:

python cv_lab.py

Expected output:

single_split_variance
seed=1 accuracy=0.965
seed=2 accuracy=0.972
seed=3 accuracy=0.986
seed=4 accuracy=0.972
seed=5 accuracy=0.979
cross_validation_lab
fold=1 accuracy=0.974
fold=2 accuracy=0.947
fold=3 accuracy=0.965
fold=4 accuracy=0.991
fold=5 accuracy=0.991
summary accuracy=0.974+/-0.017 precision=0.968 recall=0.992 f1=0.979

Cross-validation lab result map

Why One Split Is Not Enough

The same model gets different scores with different random splits:

seed=1 accuracy=0.965
seed=3 accuracy=0.986

Neither number is fake. They are just different snapshots. Cross-validation asks: "Across several snapshots, what is the average performance and how much does it vary?"

Stratified K-Fold

For classification, use StratifiedKFold first. It keeps the class ratio similar in each fold, which is especially important for imbalanced datasets.

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Use K=5 as a practical default:

less noisy than one split;
cheaper than 10-fold on large data;
easy to explain to teammates.

Use a Leakage-Safe Pipeline

Cross-validation leakage-safe pipeline diagram

This is the safe pattern:

Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000, random_state=42)),
])

During cross-validation, each fold must fit its scaler only on that fold's training portion. If you scale all data before CV, information from validation folds leaks into training.

Read Mean and Variance

The summary is more useful than one fold:

summary accuracy=0.974+/-0.017 precision=0.968 recall=0.992 f1=0.979

Read it as:

average accuracy is about 0.974;
fold-to-fold variation is about 0.017;
recall is very high, which matters if missing positives is costly.

If standard deviation is large, the model may be unstable, the dataset may be small, or some folds may contain harder cases.

When K-Fold Is Wrong

Do not shuffle randomly when:

the data is time series;
rows from the same user/session/device can appear in both train and validation;
examples are grouped by patient, customer, document, or experiment;
future information would leak into the past.

Use a split that matches the real deployment situation: TimeSeriesSplit, group splits, or a chronological holdout.

Practical Choice Guide

Situation	Use
Basic classification	`StratifiedKFold(n_splits=5, shuffle=True)`
Regression	`KFold(n_splits=5, shuffle=True)`
Time series	`TimeSeriesSplit` or chronological validation
Same entity appears many times	group-aware splitting
Hyperparameter tuning	nested CV or a final untouched test set

For experienced readers: after model selection, keep one final holdout set or production-like backtest that was not used during tuning.

Practical Debugging Checklist

Symptom	Likely cause	Fix
CV score much higher than test score	leakage or over-tuning	put preprocessing in pipeline; keep final holdout
Fold scores vary wildly	small data or hard segments	inspect fold composition and segment metrics
Classification fold has no positives	non-stratified split	use `StratifiedKFold`
Time-series model looks too good	future data leaked	validate chronologically
CV takes too long	too many folds or heavy model	use fewer folds or faster baseline first

Practice

Change n_splits to 3 and 10. How do mean and standard deviation change?
Remove stratify=y from the single split. Does the score become less stable?
Add roc_auc to the scoring list.
Move StandardScaler() outside the pipeline intentionally, then explain why that is unsafe.
Design a validation split for user events where each user has many rows.

Pass Check

You are done when you can explain:

one train-test split is only one snapshot;
K-Fold estimates average performance and variability;
classification should usually use stratified folds;
preprocessing must be inside the pipeline;
validation strategy must match deployment data flow.

What You Will Build​

Setup​

Run the Complete Lab​

Why One Split Is Not Enough​

Stratified K-Fold​

Use a Leakage-Safe Pipeline​

Read Mean and Variance​

When K-Fold Is Wrong​

Practical Choice Guide​

Practical Debugging Checklist​

Practice​

Pass Check​