5.3.3 Dimensionality Reduction

What You Will Build
Section titled “What You Will Build”This lesson uses the handwritten digits dataset to show:
- how PCA maps high-dimensional images into 2 dimensions;
- how explained variance changes when keeping 10, 20, or 40 components;
- how PCA affects classification accuracy;
- how reconstruction error drops as more components are kept;
- when PCA, t-SNE, and UMAP should be used differently.
Look at the maps first. Dimensionality reduction is not one tool with one purpose.


Keyword Decoder
Section titled “Keyword Decoder”| Term | Practical meaning |
|---|---|
dimension | One feature column, such as one pixel or one numeric field |
PCA | Principal Component Analysis; finds directions that keep as much variance as possible |
component | A new compressed feature created by PCA |
explained_variance_ratio_ | How much information-like variance each component keeps |
reconstruction | Approximate original data rebuilt from compressed components |
t-SNE | Visualization method for local neighborhood structure |
UMAP | Visualization and manifold method often used for embeddings |
python -m pip install -U scikit-learn numpyThe runnable lab uses only sklearn and NumPy. UMAP is useful in real projects, but it requires an extra package, so this beginner lab keeps the core dependency small.
Run the Complete Lab
Section titled “Run the Complete Lab”Create pca_lab.py:
import numpy as npfrom sklearn.datasets import load_digitsfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, mean_squared_errorfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler
X, y = load_digits(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42, stratify=y)
scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)
print("pca_2d_map")pca2 = PCA(n_components=2, random_state=42)X_train_2d = pca2.fit_transform(X_train_scaled)print("shape=", X_train_2d.shape)print("explained_variance=", np.round(pca2.explained_variance_ratio_, 3).tolist())print("total_2d_variance=", round(float(pca2.explained_variance_ratio_.sum()), 3))
print("pca_modeling_lab")for n in [10, 20, 40]: model = Pipeline([ ("scale", StandardScaler()), ("pca", PCA(n_components=n, random_state=42)), ("clf", LogisticRegression(max_iter=5000, random_state=42)), ]) model.fit(X_train, y_train) pred = model.predict(X_test) pca = model.named_steps["pca"] print( f"components={n:<2} " f"variance={pca.explained_variance_ratio_.sum():.3f} " f"accuracy={accuracy_score(y_test, pred):.3f}" )
print("reconstruction_lab")for n in [10, 20, 40]: pca = PCA(n_components=n, random_state=42) compressed = pca.fit_transform(X_train_scaled) restored = pca.inverse_transform(compressed) mse = mean_squared_error(X_train_scaled, restored) print(f"components={n:<2} reconstruction_mse={mse:.3f}")Run it:
python pca_lab.pyExpected output:
pca_2d_mapshape= (1347, 2)explained_variance= [0.119, 0.097]total_2d_variance= 0.216pca_modeling_labcomponents=10 variance=0.591 accuracy=0.858components=20 variance=0.791 accuracy=0.942components=40 variance=0.953 accuracy=0.960reconstruction_labcomponents=10 reconstruction_mse=0.390components=20 reconstruction_mse=0.199components=40 reconstruction_mse=0.045
Read the 2D Result
Section titled “Read the 2D Result”The digits dataset has 64 pixel features. PCA with n_components=2 compresses each image into two numbers:
shape= (1347, 2)total_2d_variance= 0.216Two components are useful for plotting, but they keep only about 21.6% of the variance. That is fine for a quick map, but too little for a serious classifier.
Explained Variance
Section titled “Explained Variance”
Explained variance helps you decide how much information to keep:
components=10 variance=0.591 accuracy=0.858components=20 variance=0.791 accuracy=0.942components=40 variance=0.953 accuracy=0.960The useful lesson is not “always keep 95%.” The useful lesson is:
- if the goal is visualization,
2or3components may be enough; - if the goal is modeling, compare accuracy or the metric your project uses;
- if the goal is compression, compare reconstruction error and storage cost.
Reconstruction Error
Section titled “Reconstruction Error”Reconstruction asks: after compression, how much of the original data can be rebuilt?
components=10 reconstruction_mse=0.390components=40 reconstruction_mse=0.045More components make reconstruction better, but they also keep more dimensions. The right number is a trade-off between compactness and useful information.
PCA in a Model Pipeline
Section titled “PCA in a Model Pipeline”The modeling block uses:
Pipeline([ ("scale", StandardScaler()), ("pca", PCA(n_components=n, random_state=42)), ("clf", LogisticRegression(max_iter=5000, random_state=42)),])This order matters:
- split train/test first;
- fit scaling on training data only;
- fit PCA on training data only;
- train the model on compressed training features;
- evaluate on transformed test features.
Putting scaling and PCA in a pipeline helps prevent data leakage during cross-validation.
PCA, t-SNE, and UMAP
Section titled “PCA, t-SNE, and UMAP”| Method | Best use | Important warning |
|---|---|---|
| PCA | compression, preprocessing, fast 2D overview | linear method; may miss curved structure |
| t-SNE | local-neighborhood visualization | distances between far clusters can be misleading |
| UMAP | embedding visualization and neighborhood exploration | needs extra package; tune parameters and verify stability |
For beginners, the safest order is:
- Start with PCA because it is fast and interpretable.
- Use t-SNE or UMAP for visualization, not as your first production feature pipeline.
- If dimensionality reduction changes a model, validate with cross-validation.
Practical Debugging Checklist
Section titled “Practical Debugging Checklist”| Symptom | Likely cause | Fix |
|---|---|---|
| PCA result dominated by one feature | features are not scaled | use StandardScaler before PCA |
| 2D plot looks nice but model is weak | 2D kept too little variance | use more components for modeling |
| Accuracy drops sharply after PCA | too many useful features were discarded | increase n_components, compare with no-PCA baseline |
| Cross-validation score is too good | PCA fitted before split | put PCA inside Pipeline |
| t-SNE/UMAP plot looks overinterpreted | visualization layout is not a proof | check stability and downstream usefulness |
Practice
Section titled “Practice”- Change PCA components to
[5, 15, 30, 50]. Where does accuracy stop improving? - Run the classifier without PCA. Is PCA helping speed, accuracy, or only compression?
- Remove
StandardScaler. How does explained variance change? - Use
PCA(n_components=0.95)and print the number of components selected. - Use the 2D PCA output to draw a scatter plot colored by digit label.
Operation guide and checkpoints
- Accuracy often improves quickly at first and then plateaus. The practical answer is the smallest component count whose score is near the best score.
- PCA may help speed and storage even if accuracy is similar. If accuracy drops, the compressed representation is losing useful signal; if accuracy rises slightly, PCA may be removing noise.
- Without scaling, features with larger numeric ranges dominate the principal components. Explained variance may look high for the wrong reason.
PCA(n_components=0.95)selects the minimum number of components that preserve about 95% of variance. Report both the count and whether the downstream score remains acceptable.- A 2D PCA plot is a diagnostic, not proof of model quality. Overlapping colors suggest the classifier needs more dimensions or a nonlinear representation.
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Task
- clustering, dimensionality reduction, or anomaly detection goal
- Data View
- scaled features, projection, clusters, or anomaly scores
- Interpretation
- what the groups, axes, or alerts mean in the scenario
- Failure Check
- arbitrary cluster count, scaling issue, noisy dimension, or false alert
- Expected Output
- unsupervised result with interpretation and uncertainty note
Pass Check
Section titled “Pass Check”You are done when you can explain:
- PCA creates new compressed features called components;
- 2D PCA is useful for visualization but may discard too much information for modeling;
- explained variance is a guide, not an automatic target;
- PCA must be fitted inside the training pipeline;
- t-SNE and UMAP are mainly visualization tools unless you validate them carefully.