Skip to content

5.1.3 Scikit-learn Follow-Along: Fit, Transform, Pipeline

Scikit-learn Estimator and Pipeline diagram

Scikit-learn is the standard Python library for classic machine learning. This page is intentionally short: first see the workflow, then run one complete script.

Unified sklearn fit-predict workflow

Most sklearn work follows the same loop:

load datasplit train/testfit on trainpredict on testscoresave evidence

The four verbs to remember:

VerbMeaningCommon object
fitlearn parameters from training dataestimator or transformer
transformapply learned preprocessingtransformer
predictproduce labels or numbersestimator
scorereturn a quick metricestimator or pipeline

sklearn Pipeline component breakdown

RoleJobExample
Estimatorlearn and predictLogisticRegression, DecisionTreeClassifier
Transformerchange data shape, scale, or representationStandardScaler, OneHotEncoder, PCA
Pipelineconnect preprocessing and model into one reusable workflowscaler -> classifier

The beginner rule: fit preprocessing only on the training set. A Pipeline helps you follow that rule automatically.

Terminal window
python -m pip install --upgrade scikit-learn joblib
python - <<'PY'
import sklearn
print(sklearn.__version__)
PY

Expected output is a version number such as:

Terminal window
1.8.0

scikit-learn is the package name you install. sklearn is the module name you import.

Create ch05_sklearn_workflow.py.

from pathlib import Path
from joblib import dump, load
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data,
iris.target,
test_size=0.25,
random_state=42,
stratify=iris.target,
)
models = {
"logistic": Pipeline([
("scale", StandardScaler()),
("model", LogisticRegression(max_iter=1000, random_state=42)),
]),
"tree": Pipeline([
("model", DecisionTreeClassifier(max_depth=3, random_state=42)),
]),
"knn": Pipeline([
("scale", StandardScaler()),
("model", KNeighborsClassifier(n_neighbors=5)),
]),
}
scores = {}
for name, pipe in models.items():
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
scores[name] = accuracy_score(y_test, pred)
print(f"{name:<8} accuracy={scores[name]:.3f}")
best_name = max(scores, key=scores.get)
best_model = models[best_name]
print(f"best={best_name}")
print("first_prediction=", iris.target_names[best_model.predict(X_test[:1])][0])
print("report_for_best:")
print(classification_report(
y_test,
best_model.predict(X_test),
target_names=iris.target_names,
zero_division=0,
))
output_path = Path("iris_pipeline.joblib")
dump(best_model, output_path)
reloaded = load(output_path)
print("reloaded_prediction=", iris.target_names[reloaded.predict(X_test[:1])][0])

Run it:

Terminal window
python ch05_sklearn_workflow.py

Expected output:

Terminal window
logistic accuracy=0.921
tree accuracy=0.895
knn accuracy=0.921
best=logistic
first_prediction= setosa
report_for_best:
precision recall f1-score support
setosa 1.00 1.00 1.00 12
versicolor 0.86 0.92 0.89 13
virginica 0.92 0.85 0.88 13
accuracy 0.92 38
macro avg 0.92 0.92 0.92 38
weighted avg 0.92 0.92 0.92 38
reloaded_prediction= setosa

sklearn lab result interpretation map

Different sklearn versions may break ties differently when two models have the same score. That is fine. The important evidence is: every model fits, predicts, scores, and the saved pipeline can predict after reload.

StandardScaler fit versus transform comic

Wrong workflow:

fit scaler on all datasplitevaluate

Why wrong: the test set already influenced preprocessing, so the score is too optimistic.

Correct workflow:

splitfit scaler on training datatransform test dataevaluate

Using Pipeline([("scale", StandardScaler()), ("model", ...)]) keeps that order for both training and prediction.

Keep this page’s proof of learning as a small evidence card:

Ml Problem
supervised, unsupervised, evaluation, or feature-engineering task
Baseline
simplest sklearn/modeling loop and fixed train/test split
Output
prediction, metric, chart, or model decision note
Failure Check
data leakage, unclear target, weak baseline, or metric mismatch
Expected Output
minimal ML loop with metric and one failure observation
SymptomFirst checkUsual fix
ModuleNotFoundError: sklearnactive Python environmentinstall with python -m pip install scikit-learn
score changes every runmissing random_stateset random_state=42 for split and models that support it
great test score, poor real resultdata leakageuse Pipeline, split before fitting preprocessing
cannot save or load modelmissing joblib or wrong pathinstall joblib, print Path.cwd()
model comparison feels unfairdifferent preprocessing pathsput each model inside a comparable Pipeline
  1. Change test_size from 0.25 to 0.2 and record the score change.
  2. Change KNeighborsClassifier(n_neighbors=5) to n_neighbors=3.
  3. Add one more model, such as SVC, using the same Pipeline pattern.
  4. Save the terminal output and iris_pipeline.joblib as your evidence.
Operation guide and checkpoints
  1. A smaller test split leaves more training examples but fewer evaluation examples, so the score may move slightly and become a little less stable.
  2. n_neighbors=3 makes KNN more local and flexible. It can improve if the boundary is sharp, or worsen if it reacts to noise.
  3. The extra model should use the same split and preprocessing path, for example Pipeline([("scaler", StandardScaler()), ("model", SVC())]), so the comparison is fair.
  4. Good evidence includes the command output, model settings, score, and saved .joblib file. The point is to prove that the full fit/evaluate/save loop ran.

You are ready for the next lesson when you can explain:

  • what fit, transform, predict, and score do;
  • why preprocessing must learn from training data only;
  • why Pipeline is safer than manual preprocessing;
  • how to compare two models with the same train/test split;
  • how to save and reload the final model.