5 Introduction to Machine Learning: From Basics to Practice

Main visual for machine learning

Chapter 5 has one job: help you turn a data problem into a trainable, evaluable, improvable machine learning project.

See The Modeling Loop

Main loop of machine learning modeling

Read the picture first. Most reliable ML work follows this loop:

define task -> split data -> train baseline -> evaluate -> inspect errors -> improve

Start with a baseline before chasing model names. A baseline tells you whether later changes actually improve anything.

Learning Order And Task List

Use this table as both the chapter guide and the task sheet.

Page	Follow-along action	Evidence to keep
5.1 ML Basics	Identify classification, regression, clustering, anomaly detection, features, labels, train/test split, and sklearn flow	A problem-definition note
5.1.5 ML History	Optional background: skim how classic algorithms appeared	A short “why this algorithm exists” note
5.2 Supervised Learning	Run regression and classification examples before comparing many models	One baseline score and one improved score
5.3 Unsupervised Learning	Try clustering, dimensionality reduction, and anomaly detection when labels are missing	One chart or cluster interpretation
5.4 Evaluation	Choose metrics, use cross-validation, diagnose bias/variance, tune carefully	Metric choice and error samples
5.5 Feature Engineering	Handle missing values, categories, scaling, feature construction, feature selection, and Pipeline	Feature processing log and leakage check
5.6 Projects and 5.6.6 Workshop	Build a reproducible evidence pack before larger house-price, churn, segmentation, or Kaggle work	README, model comparison, errors, and next-step plan

Key terms for this chapter:

Term	Meaning
`feature`	Input column the model can use
`label` / `target`	Answer the model should learn to predict
`baseline`	Simplest model or rule you must beat
`metric`	Ruler for judging the model, such as F1, AUC, MAE, or RMSE
`leakage`	Test or target information accidentally entering training
`Pipeline`	Preprocessing and model packaged together to reduce leakage

First Runnable Loop

Install sklearn if needed:

python -m pip install scikit-learn

Then run this self-contained baseline. It uses a built-in dataset, splits data, trains a dummy baseline, trains a real model, and compares both.

from sklearn.datasets import load_breast_cancer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

baseline = DummyClassifier(strategy="most_frequent")
baseline.fit(X_train, y_train)
print("Baseline")
print(classification_report(y_test, baseline.predict(X_test), zero_division=0))

model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
model.fit(X_train, y_train)
print("Logistic regression")
print(classification_report(y_test, model.predict(X_test), zero_division=0))

Expected shape:

Baseline
...
Logistic regression
...

Do not only compare the final scores. Ask: which classes are easy, which are hard, and what error would matter most in the real use case?

Depth Ladder

Level	What you can prove
Minimum pass	You can name the task type, split the data, train a baseline, and read the score.
Project-ready	You can explain why the chosen metric matches the goal, and show one error sample instead of trusting one score.
Deeper check	You can test for leakage, compare two feature choices, and say what would change in a real product or dataset update.

Common Failures

Symptom	First thing to check	Usual fix
Score is strangely high	Leakage or wrong train/test split	Inspect features and split before training
Train score high, test score low	Overfitting	Simplify the model, regularize, or add data
All models are weak	Poor labels, weak features, or wrong metric	Inspect error samples and label definition
Accuracy looks fine but product risk is high	Class imbalance or costly false negatives	Use recall, precision, F1, AUC, or threshold review
Results cannot be reproduced	Random seed, data version, or dependency changed	Fix seeds and record versions

Pass Check

Move to Chapter 6 when you can answer these five questions:

Is this task classification, regression, clustering, or anomaly detection?
What is the baseline, and what score must a real model beat?
Which metric matches the goal, and when is accuracy misleading?
How did you check for leakage?
What does the model do well, what does it do poorly, and what would you improve next?

For a printable checklist, use 5.0 Study Guide and Task Sheet. The next chapter moves from sklearn models into neural networks and deep learning training.

See The Modeling Loop​

Learning Order And Task List​

First Runnable Loop​

Depth Ladder​

Common Failures​

Pass Check​