Skip to main content

5.4.2 Evaluation Metrics

Confusion matrix and error cost diagram

Section Overview

Metrics are not report-card decorations. They decide which model you trust, which threshold you ship, and which mistake your product is willing to pay for.

What You Will Build

This lesson gives you one evaluation lab:

  • expose the accuracy trap on imbalanced classification;
  • tune thresholds and read false positives/false negatives;
  • compare ROC AUC and average precision;
  • evaluate regression with MAE, RMSE, and R2;
  • choose metrics from product cost, not from habit.

Start with the map:

Evaluation metric selection flowchart

Keyword Decoder

TermPractical meaning
TPtrue positive: real positive and predicted positive
FPfalse positive: real negative but predicted positive
FNfalse negative: real positive but missed
precisionamong predicted positives, how many were really positive
recallamong real positives, how many were found
F1harmonic mean of precision and recall
ROC AUCranking quality over many thresholds; can look optimistic on rare positives
average_precisionprecision-recall area; often better for imbalanced positive classes
MAEaverage absolute regression error
RMSEsquare-root mean squared error; punishes large misses more

Setup

python -m pip install -U scikit-learn

Run the Complete Lab

Create metrics_lab.py:

from sklearn.datasets import load_diabetes, make_classification
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import (
accuracy_score,
average_precision_score,
confusion_matrix,
f1_score,
mean_absolute_error,
mean_squared_error,
precision_score,
r2_score,
recall_score,
roc_auc_score,
)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


X, y = make_classification(
n_samples=1200,
n_features=12,
n_informative=5,
n_redundant=2,
weights=[0.92, 0.08],
class_sep=1.2,
random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)

baseline = DummyClassifier(strategy="most_frequent")
baseline.fit(X_train, y_train)
base_pred = baseline.predict(X_test)
print("classification_baseline")
print(f"accuracy={accuracy_score(y_test, base_pred):.3f}")
print(f"precision={precision_score(y_test, base_pred, zero_division=0):.3f}")
print(f"recall={recall_score(y_test, base_pred):.3f}")

model = Pipeline([
("scale", StandardScaler()),
("clf", LogisticRegression(max_iter=2000, random_state=42)),
])
model.fit(X_train, y_train)
prob = model.predict_proba(X_test)[:, 1]

print("threshold_lab")
for threshold in [0.2, 0.5, 0.8]:
pred = (prob >= threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
print(
f"threshold={threshold:.1f} "
f"accuracy={accuracy_score(y_test, pred):.3f} "
f"precision={precision_score(y_test, pred, zero_division=0):.3f} "
f"recall={recall_score(y_test, pred):.3f} "
f"f1={f1_score(y_test, pred):.3f} "
f"fp={fp} fn={fn}"
)
print(f"roc_auc={roc_auc_score(y_test, prob):.3f}")
print(f"average_precision={average_precision_score(y_test, prob):.3f}")

print("regression_lab")
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
for name, reg in [
("mean_baseline", DummyRegressor(strategy="mean")),
("linear", LinearRegression()),
]:
reg.fit(X_train, y_train)
pred = reg.predict(X_test)
rmse = mean_squared_error(y_test, pred) ** 0.5
print(
f"{name:<13} "
f"mae={mean_absolute_error(y_test, pred):.1f} "
f"rmse={rmse:.1f} "
f"r2={r2_score(y_test, pred):.3f}"
)

Run it:

python metrics_lab.py

Expected output:

classification_baseline
accuracy=0.917
precision=0.000
recall=0.000
threshold_lab
threshold=0.2 accuracy=0.907 precision=0.462 recall=0.720 f1=0.562 fp=21 fn=7
threshold=0.5 accuracy=0.943 precision=0.833 recall=0.400 f1=0.541 fp=2 fn=15
threshold=0.8 accuracy=0.923 precision=1.000 recall=0.080 f1=0.148 fp=0 fn=23
roc_auc=0.889
average_precision=0.660
regression_lab
mean_baseline mae=65.5 rmse=74.9 r2=-0.014
linear mae=41.5 rmse=53.4 r2=0.485

Evaluation metrics threshold and regression result map

The Accuracy Trap

The baseline predicts the majority class every time:

accuracy=0.917
precision=0.000
recall=0.000

That looks like a high accuracy score, but it finds zero positive cases. For imbalanced classification, accuracy alone can be actively misleading.

Confusion Matrix First

Every classification metric comes from four counts:

CountMeaning
TPpositive case correctly found
FPnormal case incorrectly flagged
FNpositive case missed
TNnormal case correctly ignored

Before choosing a metric, ask:

  • Is FP more expensive, or is FN more expensive?
  • Is the model used for screening, ranking, blocking, or final decision?
  • How many cases can humans review?

Thresholds Change the Story

Guide to reading thresholds, ROC, and PR curves

The same model gives different behavior at different thresholds:

threshold=0.2 precision=0.462 recall=0.720 fp=21 fn=7
threshold=0.8 precision=1.000 recall=0.080 fp=0 fn=23

Lowering the threshold catches more positives but creates more false alarms. Raising it creates fewer false alarms but misses more positives.

Use this guide:

Product goalPrimary metric
catch as many positives as possiblerecall
keep alerts trustworthyprecision
balance precision and recallF1
rank candidates across thresholdsROC AUC
rare positive classaverage precision / PR curve

ROC AUC vs Average Precision

roc_auc=0.889 says the model ranks positives above negatives fairly well across thresholds.

average_precision=0.660 is more strict for rare positives because it focuses on precision-recall behavior. In fraud, medical screening, security alerts, and churn rescue, always inspect precision-recall metrics, not only ROC AUC.

Regression Metrics

Regression metrics and residual diagnosis comic

The regression lab compares a mean baseline with a linear model:

mean_baseline mae=65.5 rmse=74.9 r2=-0.014
linear mae=41.5 rmse=53.4 r2=0.485

Read them like this:

MetricUse it when
MAEyou want average error in the original unit
RMSElarge errors are especially painful
R2you want to know how much better the model is than a mean baseline

Do not rely only on R2. A model can have a decent R2 while still making unacceptable errors for important cases.

Practical Metric Selection

TaskStart withThen check
Balanced classificationaccuracy, F1confusion matrix
Imbalanced classificationprecision, recall, F1PR curve, threshold table
Screening / detectionrecallalert volume and false positives
Final approval / blockingprecisionmissed positives and manual review policy
RankingROC AUC, average precisiontop-k precision
RegressionMAE, RMSEresidual plots and segment errors

For experienced readers: evaluate by segment. A global metric can hide failures on a region, customer group, language, device type, or rare class.

Practical Debugging Checklist

SymptomLikely causeFix
High accuracy, zero recallclass imbalanceuse confusion matrix and recall
Good ROC AUC, poor alertsthreshold too high or rare positivesinspect PR curve and alert volume
F1 improves but product worsensmetric does not match business costdefine FP/FN cost explicitly
Regression average looks finelarge errors hidden in a segmentinspect residuals by segment
Offline metric drops in productiondistribution shiftmonitor data and metric drift

Practice

  1. Change class weights to [0.98, 0.02]. What happens to accuracy and recall?
  2. Add thresholds [0.1, 0.3, 0.7, 0.9]. Which threshold would you ship for screening?
  3. Print tp, fp, fn, tn for every threshold.
  4. Add a tree model and compare ROC AUC and average precision.
  5. For regression, print the five largest absolute errors and inspect the inputs.

Pass Check

You are done when you can explain:

  • accuracy can be misleading on imbalanced data;
  • precision and recall describe different error costs;
  • threshold choice is part of product design;
  • ROC AUC and PR metrics answer different questions;
  • regression metrics need residual and segment checks.