5.2.5 Ensemble Learning: Forest, Boosting, Stacking

Ensemble learning combines several models so one model’s weakness is less likely to dominate the final prediction. For tabular data, this is often the strongest classic ML family.
Look at the Two Main Paths
Section titled “Look at the Two Main Paths”
Do not memorize every model name first. Separate the three main paths:
- Bagging, such as Random Forest: many models train in parallel and vote. Use it when you want stability and lower variance. Watch out for larger, harder-to-explain models.
- Boosting, such as GBDT, XGBoost, LightGBM, and CatBoost: each new model focuses on previous errors. Use it when tabular accuracy matters. Control depth, learning rate, and early stopping to avoid overfitting.
- Stacking, such as
StackingClassifier: base model predictions feed a meta-model. Use it when different model families have complementary strengths. Build it with cross-validation to avoid leakage.
Run the Comparison Lab
Section titled “Run the Comparison Lab”Create ch05_ensemble_lab.py.
from sklearn.datasets import load_breast_cancerfrom sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, StackingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, f1_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.tree import DecisionTreeClassifier
data = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split( data.data, data.target, test_size=0.25, random_state=42, stratify=data.target,)
models = { "single_tree": DecisionTreeClassifier(max_depth=4, random_state=42), "random_forest": RandomForestClassifier( n_estimators=200, max_depth=6, random_state=42, ), "gradient_boost": GradientBoostingClassifier( n_estimators=120, learning_rate=0.05, max_depth=2, random_state=42, ),}
models["stacking_cv"] = StackingClassifier( estimators=[ ("rf", models["random_forest"]), ("gb", models["gradient_boost"]), ("lr", make_pipeline( StandardScaler(), LogisticRegression(max_iter=2000, random_state=42), )), ], final_estimator=LogisticRegression(max_iter=2000, random_state=42), cv=5,)
for name, model in models.items(): model.fit(X_train, y_train) pred = model.predict(X_test) print(f"{name:<15} accuracy={accuracy_score(y_test, pred):.3f} f1={f1_score(y_test, pred):.3f}")
rf = models["random_forest"]importances = rf.feature_importances_top = importances.argsort()[-3:][::-1]print("top_rf_features=")for idx in top: print(f"- {data.feature_names[idx]}: {importances[idx]:.3f}")Run:
python ch05_ensemble_lab.pyExpected output:
single_tree accuracy=0.944 f1=0.956random_forest accuracy=0.958 f1=0.967gradient_boost accuracy=0.944 f1=0.956stacking_cv accuracy=0.986 f1=0.989top_rf_features=- worst perimeter: 0.146- worst area: 0.140- worst concave points: 0.109
Small score changes across sklearn versions are acceptable. Keep the comparison table and top features as project evidence.
Read the Result
Section titled “Read the Result”The single tree is the baseline. Random Forest usually improves stability by averaging many different trees.
Boosting is not automatically better in every small dataset. It needs careful control of tree depth, learning rate, number of trees, and validation performance.
Stacking can win here because it combines different model families, but it must use cross-validation. Training the meta-model on predictions made on the same rows used to fit the base models leaks information.
Bagging: Random Forest
Section titled “Bagging: Random Forest”
Random Forest trains many decision trees on randomized views of the data and averages/votes their predictions.
Good first settings:
| Parameter | What it controls | Beginner default |
|---|---|---|
n_estimators | number of trees | 100 to 300 |
max_depth | tree depth | start small, then increase |
min_samples_leaf | minimum samples in a leaf | increase if overfitting |
random_state | reproducibility | always set it while learning |
Boosting: GBDT and Toolkits
Section titled “Boosting: GBDT and Toolkits”
Boosting builds models in sequence:
In sklearn, start with GradientBoostingClassifier or HistGradientBoostingClassifier. In real tabular projects, XGBoost, LightGBM, and CatBoost are common external libraries, but do not add them before the sklearn baseline is clear.

First tuning order for boosting:
| Step | Change | Why |
|---|---|---|
| 1 | learning_rate and n_estimators | controls step size and training length |
| 2 | max_depth / leaf settings | controls model complexity |
| 3 | validation or early stopping | stops overfitting |
| 4 | feature preprocessing | improves signal quality |
Stacking Safely
Section titled “Stacking Safely”
Stacking is powerful only if the meta-model sees out-of-fold predictions:
Use sklearn’s StackingClassifier(cv=5) instead of manually reusing predictions from the training rows.
Choosing a Model
Section titled “Choosing a Model”| Situation | Start with |
|---|---|
| need a strong, stable baseline | Random Forest |
| tabular data with many nonlinear patterns | Gradient Boosting / XGBoost / LightGBM |
| categorical-heavy tabular data | CatBoost, after baseline |
| several model families perform differently | Stacking with cross-validation |
| need easiest explanation | shallow tree or Random Forest feature importance |
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Task
- regression or classification problem with target definition
- Model
- linear/logistic/tree/ensemble/SVM configuration and train/test split
- Metric
- regression error, accuracy/F1, threshold curve, or confusion matrix
- Failure Check
- overfitting, underfitting, feature scaling, threshold choice, or class imbalance
- Expected Output
- model result plus error samples or residual review
Common Failures
Section titled “Common Failures”| Symptom | First check | Usual fix |
|---|---|---|
| ensemble barely beats one tree | features are weak or split is unstable | add features, use cross-validation |
| train score high, test score low | overfitting | lower depth, increase leaf size, add validation |
| boosting gets worse as trees increase | too many rounds | reduce learning rate or use early stopping |
| stacking looks unrealistically perfect | leakage | use out-of-fold predictions or StackingClassifier(cv=...) |
| feature importance overread | correlated features | validate with permutation importance or ablation |
Practice
Section titled “Practice”- Change Random Forest
max_depthfrom6to3andNone. - Change Gradient Boosting
learning_ratefrom0.05to0.2. - Remove
cv=5from your mental model and explain why stacking would leak without cross-validation. - Save a model comparison table and one paragraph explaining which model you would ship first.
Reference implementation and walkthrough
max_depth=3limits each tree and can reduce overfitting.Noneallows deeper trees, often improving training score while risking worse validation behavior.- A higher boosting learning rate learns faster but can overshoot or overfit. Check validation score, not just training score.
- Stacking leaks when the meta-model learns from base-model predictions on rows those base models already trained on. Cross-validation creates out-of-fold predictions that are closer to real deployment.
- A shipping decision should name the model, validation metric, complexity, failure risk, and monitoring plan. The best first model is often the simplest one that meets the target metric reliably.
Pass Check
Section titled “Pass Check”You are ready to continue when you can explain:
- the difference between Bagging and Boosting;
- why Random Forest is usually safer than one tree;
- why Boosting needs validation control;
- why Stacking must use cross-validation;
- why the best leaderboard score is not always the best production choice.