Skip to content

5.5.6 Pipeline and Workflows

ColumnTransformer and Pipeline workflow diagram

  • Master ColumnTransformer for handling mixed data types
  • Learn how to create custom Transformers
  • Build a complete feature engineering pipeline

Many beginners can do each step separately, but once they get to a real project, things get messy. Pipeline solves this problem:

How can we turn “data processing -> feature engineering -> model training” into a stable, reproducible, leak-free workflow?

flowchart LR
A["Raw data"] --> B["Split by column"]
B --> C["Numeric processing"]
B --> D["Categorical processing"]
B --> E["Custom features"]
C --> F["Merge"]
D --> F
E --> F
F --> G["Model training / prediction"]

You can think of a Pipeline as:

  • Putting scattered manual steps into an automatic assembly line

Without a Pipeline, you may end up doing this:

  • Manually filling missing values
  • Manually encoding categories
  • Manually scaling features
  • Manually feeding the result into the model

This is like:

  • Writing the process on paper every time, which makes it very easy to miss a step

The value of a Pipeline is:

  • It fixes the workflow so training and prediction always follow the same rules
  • Avoid inconsistent preprocessing between training and test sets
  • Avoid data leakage
  • Make cross-validation and hyperparameter tuning easier
  • Reuse the whole workflow on new data

When are you most likely to make mistakes?

Section titled “When are you most likely to make mistakes?”

The most common mistake is:

  • Preprocessing the training set manually one way
  • Preprocessing the test set manually in a different way

As a result, the model does not see the same kind of data at all. The most important role of Pipeline is to prevent this kind of “the workflow drifted, but you didn’t notice” problem.

ColumnTransformer — process columns separately

Section titled “ColumnTransformer — process columns separately”
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
df = sns.load_dataset('titanic').dropna(subset=['embarked'])
# Define features
num_features = ['age', 'fare']
cat_features = ['sex', 'embarked', 'class']
# Numeric preprocessing pipeline
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
# Categorical preprocessing pipeline
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(drop='first', sparse_output=False)),
])
# Combine
preprocessor = ColumnTransformer([
('num', num_pipeline, num_features),
('cat', cat_pipeline, cat_features),
])
X = df[num_features + cat_features]
y = df['survived']
X_transformed = preprocessor.fit_transform(X)
print(f"Raw: {X.shape} → After processing: {X_transformed.shape}")

What is the most important thing to notice in this example?

Section titled “What is the most important thing to notice in this example?”

The most important thing to notice is:

  • Different columns should follow different processing paths

That means:

  • Numeric columns should not be processed with categorical encoding methods
  • Categorical columns should not be directly standardized

When many beginners first work with tabular data, the problem is often not the model choice, but that the column processing strategy is mixed up from the start.


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Preprocessing + model in one step
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42)),
])
scores = cross_val_score(full_pipeline, X, y, cv=5, scoring='accuracy')
print(f"5-fold CV accuracy: {scores.mean():.4f} ± {scores.std():.4f}")

Why do Pipeline and cross-validation work so well together?

Section titled “Why do Pipeline and cross-validation work so well together?”

Because the essence of cross-validation is:

  • Each fold is retrained from scratch

If your preprocessing is written inside a Pipeline, then for each fold it will automatically:

  • Fit only on the training fold
  • Apply the same rules to the validation fold

This is exactly the key to preventing data leakage.

ColumnTransformer Pipeline diagram for real tabular data

This diagram breaks a real tabular-data project into three paths: numeric columns are imputed first and then scaled, categorical columns are imputed first and then encoded, and custom features must also be included in the same Pipeline. Finally, the whole process is handed over to cross-validation or GridSearch, so training, validation, and prediction all follow the same reproducible workflow.


from sklearn.base import BaseEstimator, TransformerMixin
class FamilySizeTransformer(BaseEstimator, TransformerMixin):
"""Construct family size features from sibsp and parch"""
def fit(self, X, y=None):
return self
def transform(self, X):
X = X.copy()
X['family_size'] = X['sibsp'] + X['parch'] + 1
X['is_alone'] = (X['family_size'] == 1).astype(int)
return X[['family_size', 'is_alone']]
# Use it
custom_features = ['sibsp', 'parch']
full_preprocessor = ColumnTransformer([
('num', num_pipeline, num_features),
('cat', cat_pipeline, cat_features),
('custom', FamilySizeTransformer(), custom_features),
])
pipe = Pipeline([
('preprocessor', full_preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42)),
])
scores = cross_val_score(pipe, df[num_features + cat_features + custom_features], y, cv=5)
print(f"With custom features: {scores.mean():.4f} ± {scores.std():.4f}")

It is most useful when:

  • You already know that a feature construction method is valuable
  • And it needs to be reliably reused during both training and prediction

For example:

  • Family size
  • Whether someone is alone
  • Area per room

In these cases, writing it as a Transformer is much more stable than copying code around in different places.


from sklearn.model_selection import GridSearchCV
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [5, 10, None],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(df[num_features + cat_features + custom_features], y)
print(f"Best parameters: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.4f}")

The minimum pipeline beginners should master first

Section titled “The minimum pipeline beginners should master first”

If you are just starting with ML projects, at least learn to write the following chain smoothly:

  1. Missing-value imputation
  2. Numeric scaling
  3. Categorical encoding
  4. Model training
  5. Cross-validation

Once you can write this chain fluently, adding more complex features and tuning hyperparameters will become much easier.

flowchart LR
D["Raw data"] --> CT["ColumnTransformer"]
CT --> N["Numeric: impute → standardize"]
CT --> C["Categorical: impute → encode"]
CT --> CU["Custom: feature construction"]
N --> M["Merge"]
C --> M
CU --> M
M --> CL["Classifier"]
CL --> R["Prediction"]
style CT fill:#e3f2fd,stroke:#1565c0,color:#333
style CL fill:#fff3e0,stroke:#e65100,color:#333
style R fill:#e8f5e9,stroke:#2e7d32,color:#333

A workflow checklist beginners can copy directly

Section titled “A workflow checklist beginners can copy directly”

When you build a tabular-data project for the first time, the safest checklist is usually:

  1. Is missing-value handling included in the Pipeline?
  2. Are numeric columns and categorical columns clearly separated?
  3. Is cross-validation running on the complete Pipeline?
  4. Are custom features flowing through training and prediction together?

If you get all 4 of these right, your project is already much more stable than many versions that “run, but cannot be reproduced.”


Keep this page’s proof of learning as a small evidence card:

Feature State
raw columns, types, missing values, scale, and target relationship
Transformation
preprocessing, construction, selection, or pipeline step
Output
transformed feature table, pipeline object, score change, or selected features
Failure Check
leakage, inconsistent train/test transform, high-cardinality trap, or meaningless feature
Expected Output
feature pipeline evidence with before/after and metric impact
ComponentDescription
PipelineChains multiple steps together
ColumnTransformerApplies different processing to different columns
Custom TransformerInherits from BaseEstimator + TransformerMixin
Pipeline + GridSearchTunes preprocessing and model together

Build a complete Pipeline (including numeric processing, categorical encoding, and custom features), and compare the performance of Random Forest and Logistic Regression.

Exercise 2: Pipeline hyperparameter tuning

Section titled “Exercise 2: Pipeline hyperparameter tuning”

Use GridSearchCV on the Pipeline from Exercise 1 to tune both preprocessing parameters (such as PCA n_components) and model parameters at the same time.

Solution approach and explanation
  1. A complete Titanic Pipeline should include imputation, numeric scaling where needed, categorical encoding, custom features, and the estimator. The same object should handle both training and prediction.
  2. RandomForest may need less scaling and can capture nonlinear splits; LogisticRegression usually benefits more from scaling and cleaner encoding. Compare them with the same validation protocol.
  3. GridSearchCV should tune preprocessing and model parameters together because choices interact. For example, dimensionality reduction or feature selection can change which model settings work best.
  4. The final deliverable should include the fitted pipeline, CV scores, chosen parameters, and one leakage check showing that preprocessing happens inside the pipeline.