5.6.3 Project 2: Customer Churn Prediction (Classification Problem)

Customer Churn Prediction Project Flowchart

Project Positioning

Customer churn prediction is one of the most classic business ML applications. This project focuses on practicing: handling imbalanced data, understanding business metrics, and extracting business insights from model results.

Project Overview

Information	Description
Task type	Binary classification (churn / retain)
Core challenge	Imbalanced data (far fewer churn customers than retained customers)
Evaluation metrics	F1, AUC, recall
Skills involved	Imbalance handling, Pipeline, business analysis

Key Terms Before You Read the Code

Recall asks: among customers who really churned, how many did we catch? It matters when missing a churn customer is expensive.
Precision asks: among customers we flagged as high-risk, how many truly churned? It matters when retention actions are costly.
F1 is the harmonic mean of precision and recall. It is useful when you need one balanced number, but it hides the business trade-off.
ROC (Receiver Operating Characteristic) shows how recall changes as the false-positive rate changes across thresholds.
AUC (Area Under the Curve) summarizes the ROC curve into one number. A higher AUC means the model ranks churn customers ahead of retained customers more reliably.
SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic minority-class samples. It can help imbalanced data, but it must be applied only inside the training split or cross-validation fold.
class_weight tells the model to penalize minority-class mistakes more heavily without creating synthetic samples.
Threshold is the probability cutoff for predicting churn. Moving it lower usually increases recall but also increases false alarms.

First, let’s set an important learning expectation

This problem is very easy for beginners to fall into “model comparison” right away:

Logistic Regression
Random Forest
SMOTE
AUC

But for the first round, what is actually more worth practicing is not who gets the highest score, but:

How do you truly connect business cost, metric selection, threshold decisions, and model results when facing an imbalanced classification problem?

As long as this line is clear first, this exercise will feel like a real project instead of “just another classification problem.”

First, build a map

The most valuable part of this problem is not “build a binary classifier,” but facing these for the first time:

Imbalanced data
Threshold selection
Different business costs

So what this problem really trains is “how to make classification decisions,” not simply “how to run a classification model.”

What you really need to practice in this problem

The core of this project is not “make the classifier run,” but to practice:

Why you cannot rely on accuracy alone for imbalanced data
How to balance recall, precision, and business cost
How to translate model results into business insights

What should be explained first in this problem

When doing this problem for the first time, what should be explained first is not the model names, but:

What proportion of customers churn
Why accuracy alone is not enough
If the business is more afraid of missing churn customers, which metric should be prioritized

Once these three things are clear, the later model choices and threshold decisions will make sense.

A more beginner-friendly analogy

You can think of this problem as:

Creating a “high-risk list” before customers actually leave

The value of this list is not:

Judging every single person perfectly

But:

Whether you can miss as few truly churn-prone customers as possible while keeping false alarms at an acceptable cost

That is why, from the very beginning, this problem cannot just focus on accuracy.

Recommended progression

First build a baseline without handling imbalance
Then try class weights
Then try methods like SMOTE
Finally compare ROC, AUC, F1, and business interpretation

This way, you can tell where the “improvement” really comes from.

The safest default sequence for your first attempt

If this is your first time doing customer churn prediction, it is recommended to follow this order:

Clarify the business goal first
Check the class distribution first
Build the raw baseline first
Then try class_weight
Finally try SMOTE
Then decide whether the threshold should lean more toward recall or precision

This will help you understand whether each improvement comes from:

The model
Sampling
Or the threshold strategy

Step 1: Simulate data

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# Generate imbalanced customer data
X, y = make_classification(
    n_samples=5000, n_features=15, n_informative=8,
    n_redundant=3, weights=[0.85, 0.15],  # 85% retained, 15% churn
    random_state=42
)

feature_names = ['Monthly Spending', 'Call Duration', 'Data Usage', 'Customer Service Calls', 'Contract Length',
                 'Billing Disputes', 'Plan Tier', 'Household Size', 'Tenure', 'Complaints Last Month',
                 'Data Overages', 'International Roaming', 'Number of Value-added Services', 'Account Balance', 'Device Changes']

df = pd.DataFrame(X, columns=feature_names)
df['Churn'] = y

print(f"Data shape: {df.shape}")
print(f"Churn rate: {df['Churn'].mean():.1%}")
print(f"Churn customers: {df['Churn'].sum()}, Retained customers: {(1-df['Churn']).sum():.0f}")

Step 2: Handle imbalanced data

Churn imbalance and threshold map

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt

X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Method 1: class weights
rf_weighted = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf_weighted.fit(X_train, y_train)
y_pred = rf_weighted.predict(X_test)

print("Random Forest with class weights:")
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churn']))
print(f"AUC: {roc_auc_score(y_test, rf_weighted.predict_proba(X_test)[:,1]):.4f}")

Step 2.1 Why you should not jump straight to SMOTE

A more stable order is usually:

Build the raw baseline first
Then try class_weight
Finally try SMOTE

Because only then can you tell the difference between:

Improvement from the model itself
Improvement from the sampling strategy
Improvement from threshold adjustment

SMOTE oversampling

# python -m pip install --upgrade imbalanced-learn
try:
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline as ImbPipeline

    smote_pipe = ImbPipeline([
        ('smote', SMOTE(random_state=42)),
        ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)),
    ])
    smote_pipe.fit(X_train, y_train)
    y_pred_smote = smote_pipe.predict(X_test)

    print("\nSMOTE + Random Forest:")
    print(classification_report(y_test, y_pred_smote, target_names=['Retained', 'Churn']))
except ImportError:
    print("Please install imbalanced-learn: python -m pip install --upgrade imbalanced-learn")

Step 3: Feature importance and business insights

# Feature importance
importance = rf_weighted.feature_importances_
sorted_idx = np.argsort(importance)

plt.figure(figsize=(8, 8))
plt.barh(range(len(sorted_idx)), importance[sorted_idx], color='coral')
plt.yticks(range(len(sorted_idx)), np.array(feature_names)[sorted_idx])
plt.xlabel('Feature Importance')
plt.title('Customer Churn Prediction — Feature Importance')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

# Business suggestions
print("\nBusiness insights:")
top3 = np.array(feature_names)[np.argsort(importance)[-3:]]
for i, feat in enumerate(reversed(top3), 1):
    print(f"  {i}. {feat} is most important for churn prediction")

Step 4: ROC comparison

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

models = {
    'Logistic Regression': make_pipeline(StandardScaler(), LogisticRegression(class_weight='balanced', max_iter=1000)),
    'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42),
}

plt.figure(figsize=(8, 6))
for name, model in models.items():
    model.fit(X_train, y_train)
    proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, proba)
    auc = roc_auc_score(y_test, proba)
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={auc:.4f})')

plt.plot([0, 1], [0, 1], 'k--', alpha=0.5)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('Customer Churn Prediction ROC Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Step 4.1 What is most worth adding here

If you want this project to feel more like a real business project, the most valuable additions are:

A confusion matrix
A threshold vs. Precision / Recall / F1 curve
An explanation of “if recall is the priority, where would I move the threshold”

Because in many real retention projects, the key is not who scores higher under the default threshold of 0.5, but:

Catching as many high-risk customers as possible while keeping false alarm cost acceptable

What you should include in the final project deliverable

A class distribution chart
A confusion matrix
An ROC curve
An explanation of “if the goal is to recall as many churn customers as possible, how would I tune the threshold”

A more realistic project review order

You can write the project review in this order:

Data distribution and business goal
Baseline model results
Changes after handling imbalance
Metric trade-offs and threshold adjustment
Feature importance and business recommendations
How to monitor it after deployment

If you keep improving this project, what is most worth adding

The following are usually worth prioritizing:

Threshold tuning page
Misclassified customer case analysis
Metric-switching explanations under different business goals

This will turn the project from “a classification task” into “a work that feels more like a real business decision system.”

What is most worth showing in a portfolio

Class distribution and task goal
Comparison between baseline and improved versions
ROC or PR curve
A threshold explanation chart
A set of actionable customer retention recommendations

Project checklist

Analyze the degree of class imbalance
Try at least 2 imbalance-handling methods (class weights, SMOTE)
Evaluate with F1 and AUC (not accuracy alone)
Analyze feature importance and provide business recommendations
Compare multiple models with ROC curves

Version roadmap suggestion

Version	Goal	Delivery focus
Basic version	Complete the minimum working loop	Can input, process, and output, with a sample set retained
Standard version	Build a project ready to showcase	Add configuration, logging, error handling, README, and screenshots
Challenge version	Approach portfolio quality	Add evaluation, comparison experiments, failed-case analysis, and next-step roadmap

It is recommended to finish the basic version first; do not chase a large and all-inclusive solution from the beginning. Each time you upgrade a version, make sure to write in the README what new capability was added, how it was validated, and what problems still remain.

Project Overview​

Key Terms Before You Read the Code​

First, let’s set an important learning expectation​

First, build a map​

What you really need to practice in this problem​

What should be explained first in this problem​

A more beginner-friendly analogy​

Recommended progression​

The safest default sequence for your first attempt​

Step 1: Simulate data​

Step 2: Handle imbalanced data​

Step 2.1 Why you should not jump straight to SMOTE​

SMOTE oversampling​

Step 3: Feature importance and business insights​

Step 4: ROC comparison​

Step 4.1 What is most worth adding here​

What you should include in the final project deliverable​

A more realistic project review order​

If you keep improving this project, what is most worth adding​

What is most worth showing in a portfolio​

Project checklist​

Version roadmap suggestion​