Skip to main content

5.6.5 Kaggle Competition Practice (Elective)

Kaggle competition submission loop diagram

Section focus

Kaggle is the world’s largest data science competition platform. By joining a beginner competition, you can connect all the skills you learned earlier and verify them against a real scoring system.

Learning Objectives

  • Understand the Kaggle platform and competition workflow
  • Learn how to participate in a beginner-level competition (Titanic)
  • Learn techniques from high-quality Notebooks

Key Terms Before You Read the Code

  • Notebook means an online or local interactive document, usually Jupyter-style, where text, code, output, and charts are written together.
  • Leaderboard is Kaggle’s public ranking table. It is useful feedback, but overusing it can make you overfit the public test set.
  • Submission file is the CSV file you upload for scoring. Its column names and row order must match the competition rules exactly.
  • Public score vs. local validation: the public score comes from Kaggle’s hidden labels, while local validation comes from your own split. If they disagree, your validation setup may be unreliable.
  • Baseline is the first simple model used as a comparison point. Without a baseline, you cannot tell whether later changes truly improved the project.

First, Build a Map

The easiest way for beginners to go wrong on Kaggle is to focus only on the leaderboard and not know what they are actually practicing.

A better way to understand it is:

So the most important thing in this section is not “how high a score can you get,” but how to put the entire ML workflow you learned earlier into a real evaluation environment.

What You Really Need to Practice Here

What is most valuable for beginners on Kaggle is not really “what rank did I get,” but:

  • Completing a full project for the first time with real data and real evaluation rules
  • Learning to turn baseline, cross-validation, feature engineering, and submission records into a closed loop
  • Learning to tell whether “good local validation” and “good leaderboard score” are actually the same thing

Kaggle Platform Basics

Core Features

FeatureDescription
CompetitionsCompetitions (beginner / standard / prize)
DatasetsMassive free datasets
NotebooksOnline Jupyter environment (free GPU)
DiscussionForum (learn from others’ ideas)
LearnOfficial free courses

Competition Workflow


Beginner Competition: Titanic Survival Prediction

Full Solution Workflow

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# 1. Load data (download from Kaggle or use seaborn)
import seaborn as sns
df = sns.load_dataset('titanic').dropna(subset=['embarked'])

# 2. Feature engineering
df['family_size'] = df['sibsp'] + df['parch'] + 1
df['is_alone'] = (df['family_size'] == 1).astype(int)

# 3. Define features
num_features = ['age', 'fare', 'family_size']
cat_features = ['sex', 'embarked', 'class']
all_features = num_features + cat_features

X = df[all_features]
y = df['survived']

# 4. Build Pipeline
preprocessor = ColumnTransformer([
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
]), num_features),
('cat', Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(drop='first', sparse_output=False)),
]), cat_features),
])

# 5. Model comparison
models = {
'Random Forest': RandomForestClassifier(n_estimators=200, random_state=42),
'GBDT': GradientBoostingClassifier(n_estimators=200, random_state=42),
}

for name, model in models.items():
pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', model),
])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"{name}: {scores.mean():.4f} ± {scores.std():.4f}")

What Is the Safest Goal for Your First Kaggle Competition?

When you play Kaggle for the first time, it is not recommended to set your goal as “ranking up.” A safer goal is:

  1. Submit your first valid result
  2. Build a clear baseline
  3. Complete at least two rounds of documented improvements
  4. Be able to explain why each change improved the score or why it did not

If you can do these four things, you have already learned the most important parts.

Generate a Submission File

# Standard submission format in Kaggle competitions
# Assume test_df is the test set
# pipe.fit(X_train, y_train)
# predictions = pipe.predict(test_df[all_features])
#
# submission = pd.DataFrame({
# 'PassengerId': test_df['PassengerId'],
# 'Survived': predictions
# })
# submission.to_csv('submission.csv', index=False)
# print(f"Submission file shape: {submission.shape}")

Techniques for Improving Competition Scores

Score Improvement Path

StageFocusExpected Improvement
BaselineSimple model + default parameters
Feature engineeringCreate new features, optimize encodingSignificant
Model selectionTry multiple modelsModerate
Hyperparameter tuningGridSearch / OptunaSmall
Model ensemblingStacking / BlendingSmall but stable

Common Pitfalls for Beginners on Kaggle

Kaggle validation and leaderboard loop

  • Repeatedly trying things on the public leaderboard and overfitting to it
  • Having no local cross-validation and only watching the online score
  • Changing too many things at once, so you cannot tell what caused the improvement
  • Copying a high-scoring Notebook directly, without being able to explain what you actually learned

So a safer approach is:

  • First make your local validation process solid
  • Change only one major factor at a time
  • Turn every submission into an experiment record

Learn from High-Quality Notebooks

What to Look AtWhy
Most upvoted NotebooksCommunity-approved ideas
EDA NotebooksLearn data exploration techniques
High-ranking competitors’ shared notebooksLearn feature engineering and ensembling strategies
Discussion sectionUnderstand data leakage, scoring pitfalls, and more

CompetitionTypeDifficultyDescription
TitanicClassificationBeginnerClassic starter competition with rich community resources
House PricesRegressionBeginnerHouse price prediction, good for feature engineering practice
Digit RecognizerImage classificationBeginnerMNIST, can try a simple CNN
Spaceship TitanicClassificationBeginnerAn upgraded version of Titanic

The Safest Way for Beginners to Participate in Kaggle

  1. Choose only beginner-level tasks
  2. Build a baseline first, and do not chase complex ensembling
  3. Change only one thing at a time
  4. Record what changed in each submission and why the score changed

This way, you learn a method, not just how to copy a high-scoring Notebook.

If You Treat Kaggle as a Course Training Ground, How Should You Use It?

A very recommended way is:

  1. Use Kaggle to find a real-world task
  2. Use the methods from the course to build a baseline
  3. Iterate using the evaluation and feature engineering ideas from the course
  4. Finally, organize the results into your own project review

In this way, Kaggle will not lead you into “just chasing rankings,” but instead become the best practical amplifier for the fifth stage.


Suggested Version Roadmap

VersionGoalDelivery Focus
Basic versionGet the minimum loop workingCan input, process, and output, while keeping one set of examples
Standard versionForm a presentable projectAdd configuration, logs, error handling, README, and screenshots
Challenge versionClose to portfolio qualityAdd evaluation, comparison experiments, failure sample analysis, and next-step roadmap

It is recommended to finish the basic version first. Do not try to make everything complete from the start. Each time you level up, write into the README what new capabilities were added, how they were verified, and what issues remain.

Summary

Key PointDescription
Start with beginner competitionsTitanic / House Prices
Build a baseline before optimizingDo not jump into complex models right away
Study excellent NotebooksLearn from the best
Feature engineering matters mostIt gives much more return than tuning
Keep submitting and iteratingSubmit every improvement and check the effect

Hands-on Challenges

Challenge 1: Titanic Sprint to 0.80+

Register an account on Kaggle, join the Titanic competition, and use all the skills learned in this course (feature engineering + Pipeline + model tuning) to try to reach a score of 0.80+.

Challenge 2: House Prices Practice

Join Kaggle’s House Prices competition, use a larger dataset to practice regression tasks, and focus on missing value handling and high-dimensional categorical feature encoding.