Skip to content

5.4.1 Evaluation Roadmap: Trust the Score Before Tuning

Model evaluation answers: is the model actually good, or did the score only look good by accident?

Model Evaluation Learning Map

Chapter Flow for Model Evaluation

TopicFirst question
metricswhat score matches the task?
cross-validationis the score stable across splits?
bias-varianceis the model too simple or too flexible?
tuningwhich parameter change is actually better?

Create evaluation_first_loop.py and run it after installing scikit-learn.

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
X, y = load_iris(return_X_y=True)
model = DecisionTreeClassifier(max_depth=2, random_state=42)
scores = cross_val_score(model, X, y, cv=5)
print("fold_scores:", [float(round(score, 3)) for score in scores])
print("mean_accuracy:", round(scores.mean(), 3))

Expected output:

Terminal window
fold_scores: [0.933, 0.967, 0.9, 0.867, 1.0]
mean_accuracy: 0.933

One score is a snapshot. Several folds tell you whether the result is stable enough to trust.

OrderReadWhat to practice
15.4.2 Evaluation Metricsaccuracy, precision, recall, F1, R2, RMSE
25.4.3 Cross-Validationstable estimates, data split risk
35.4.4 Bias and Varianceunderfitting, overfitting, learning curves
45.4.5 Hyperparameter Tuninggrid search, comparison records

Keep this page’s proof of learning as a small evidence card:

Evaluation Setup
split, cross-validation, metric, baseline, and comparison target
Result
score table, curve, confusion matrix, validation result, or search outcome
Decision
whether to change data, features, model, threshold, or hyperparameters
Failure Check
leakage, unstable validation, wrong metric, or tuning on the test set
Expected Output
evaluation record that supports a next modeling decision

You pass this roadmap when you can choose a metric for the task, explain one score stability check, and avoid tuning before the evaluation method is trustworthy.

Check reasoning and explanation
  1. Choose the metric from the task goal and mistake cost before tuning the model.
  2. Cross-validation answers whether the score is stable across splits; one lucky split is not enough evidence.
  3. Do not tune on the final test set. Keep a comparison record that states the baseline, metric, validation method, result, and next decision.