5.4.1 Evaluation Roadmap: Trust the Score Before Tuning
Model evaluation answers: is the model actually good, or did the score only look good by accident?
Look at the Evaluation Map First
Section titled “Look at the Evaluation Map First”

| Topic | First question |
|---|---|
| metrics | what score matches the task? |
| cross-validation | is the score stable across splits? |
| bias-variance | is the model too simple or too flexible? |
| tuning | which parameter change is actually better? |
Run One Cross-Validation Check
Section titled “Run One Cross-Validation Check”Create evaluation_first_loop.py and run it after installing scikit-learn.
from sklearn.datasets import load_irisfrom sklearn.model_selection import cross_val_scorefrom sklearn.tree import DecisionTreeClassifier
X, y = load_iris(return_X_y=True)model = DecisionTreeClassifier(max_depth=2, random_state=42)scores = cross_val_score(model, X, y, cv=5)
print("fold_scores:", [float(round(score, 3)) for score in scores])print("mean_accuracy:", round(scores.mean(), 3))Expected output:
fold_scores: [0.933, 0.967, 0.9, 0.867, 1.0]mean_accuracy: 0.933One score is a snapshot. Several folds tell you whether the result is stable enough to trust.
Learn in This Order
Section titled “Learn in This Order”| Order | Read | What to practice |
|---|---|---|
| 1 | 5.4.2 Evaluation Metrics | accuracy, precision, recall, F1, R2, RMSE |
| 2 | 5.4.3 Cross-Validation | stable estimates, data split risk |
| 3 | 5.4.4 Bias and Variance | underfitting, overfitting, learning curves |
| 4 | 5.4.5 Hyperparameter Tuning | grid search, comparison records |
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Evaluation Setup
- split, cross-validation, metric, baseline, and comparison target
- Result
- score table, curve, confusion matrix, validation result, or search outcome
- Decision
- whether to change data, features, model, threshold, or hyperparameters
- Failure Check
- leakage, unstable validation, wrong metric, or tuning on the test set
- Expected Output
- evaluation record that supports a next modeling decision
Pass Check
Section titled “Pass Check”You pass this roadmap when you can choose a metric for the task, explain one score stability check, and avoid tuning before the evaluation method is trustworthy.
Check reasoning and explanation
- Choose the metric from the task goal and mistake cost before tuning the model.
- Cross-validation answers whether the score is stable across splits; one lucky split is not enough evidence.
- Do not tune on the final test set. Keep a comparison record that states the baseline, metric, validation method, result, and next decision.