Skip to content

3.6.1 Hands-on Project: Exploratory Data Analysis (EDA)

EDA exploratory data analysis workflow

When doing an EDA project for the first time, the safest sequence is not “draw all the charts first,” but to first understand:

flowchart LR
A["First get familiar with the data"] --> B["Then clean it"]
B --> C["Then do statistics and grouped analysis"]
C --> D["Then visualize to verify conclusions"]
D --> E["Finally write clear conclusions"]

So what this project really trains is:

  • not whether you can make a few charts
  • but whether you can turn “look at the data -> reach conclusions” into a complete chain

Exploratory Data Analysis (EDA) is the first step in a data science project — before modeling, use statistics and visualization to “get to know” the data.

flowchart LR
A["Get the data"] --> B["Initial understanding"]
B --> C["Data cleaning"]
C --> D["Statistical analysis"]
D --> E["Visual exploration"]
E --> F["Draw conclusions & write report"]
style A fill:#e3f2fd,stroke:#1565c0,color:#333
style F fill:#e8f5e9,stroke:#2e7d32,color:#333

You can think of EDA as:

  • a site survey before actually building a model

You wouldn’t start working before you’ve clearly seen the terrain. Likewise, in a data project, you shouldn’t rush into modeling before you’ve understood:

  • the distribution
  • missing values
  • outliers
  • relationships between variables
SkillCorresponding chapter
Pandas data loading and cleaningChapter 3
Statistical summaries and grouped aggregationChapter 3
Matplotlib / Seaborn visualizationChapter 4
NumPy numerical computationChapter 2

After finishing, you will have a complete EDA report (a Jupyter Notebook) that includes a data overview, cleaning process, statistical findings, and visual charts.


We will use Seaborn’s built-in tips dataset — a record of tips from a U.S. restaurant.

FieldMeaningType
total_billTotal bill amount (USD)Continuous
tipTip amount (USD)Continuous
sexCustomer genderCategorical
smokerSmoker or notCategorical
dayDay of the weekCategorical
timeLunch/DinnerCategorical
sizeParty sizeDiscrete
# Import all required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Configure Chinese font display (macOS)
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
# Windows users can use: plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# Set Seaborn theme
sns.set_theme(style="whitegrid", font_scale=1.1)
# Show plots inline in Jupyter
# %matplotlib inline
# Load the built-in dataset
tips = sns.load_dataset("tips")
# First look: see what the data looks like
print(f"Dataset size: {tips.shape[0]} rows × {tips.shape[1]} columns")
tips.head(10)

Example output:

The dataset has these columns: total_bill, tip, sex, smoker, day, time, and size.

Two sample rows are enough for a first read:

RowWhat it says
0A Sunday dinner table of 2, total bill 16.99, tip 1.01, non-smoker
1A Sunday dinner table of 3, total bill 10.34, tip 1.66, non-smoker

Data overview — start with “getting familiar”

Section titled “Data overview — start with “getting familiar””

The first step in EDA is not to rush into charts, but to first understand: How big is the dataset? What type is each column? Are there missing values?

What should you ask first when looking at data?

Section titled “What should you ask first when looking at data?”

The 4 most important questions are:

  1. How large is the table?
  2. What type is each column?
  3. Are there any missing values?
  4. What does the target analysis field look like?

If you can answer these 4 questions clearly first, many later analysis steps will go much more smoothly.

# Data types and non-null counts
tips.info()

The output tells you:

  • 7 columns, 244 rows
  • No missing values (all Non-Null Count values are 244)
  • total_bill and tip are float64
  • sex, smoker, day, and time are category
# Statistical summary
tips.describe()
total_billtipsize
count244.0244.0244.0
mean19.793.002.57
std8.901.380.95
min3.071.001.00
25%13.352.002.00
50%17.802.902.00
75%24.133.563.00
max50.8110.006.00

Findings:

  • Average bill is about 19.79 USD, and average tip is about 3.00 USD
  • Tips range from 1 USD to 10 USD
  • Most parties have 2 people
# Count each value for categorical variables
for col in ['sex', 'smoker', 'day', 'time']:
print(f"\n--- {col} ---")
print(tips[col].value_counts())

Findings:

  • There are more male customers than female customers (157 vs 87)
  • There are more non-smokers than smokers (151 vs 93)
  • Saturday and Sunday have the most records
  • Dinner data is far more common than lunch data (176 vs 68)

Good analysts create new features to help discover patterns:

# Tip percentage = tip / total bill
tips['tip_pct'] = (tips['tip'] / tips['total_bill'] * 100).round(2)
# Per-person spending
tips['per_person'] = (tips['total_bill'] / tips['size']).round(2)
tips[['total_bill', 'tip_pct', 'per_person']].head()
Rowtotal_billtip_pctper_person
016.995.948.50
110.3416.053.45
221.0116.667.00

This dataset is quite clean, but in real projects this step usually takes the most time. We will still go through the full process:

# Missing value statistics
missing = tips.isnull().sum()
print("Missing value statistics:")
print(missing[missing > 0] if missing.sum() > 0 else "No missing values ✓")
# Completely duplicated rows
dup_count = tips.duplicated().sum()
print(f"Duplicate rows: {dup_count}")
if dup_count > 0:
tips = tips.drop_duplicates()
print(f"Duplicates removed, {len(tips)} rows remaining")

Use the IQR (interquartile range) method to detect outliers:

def detect_outliers_iqr(df, column):
"""Detect outliers using the IQR method"""
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower) | (df[column] > upper)]
return outliers, lower, upper
# Check outliers in each numeric column
for col in ['total_bill', 'tip', 'tip_pct']:
outliers, lower, upper = detect_outliers_iqr(tips, col)
print(f"\n{col}: normal range [{lower:.2f}, {upper:.2f}], {len(outliers)} outliers")
if len(outliers) > 0:
print(f" Outlier examples: {outliers[col].values[:5]}")

Statistical analysis — let the numbers speak

Section titled “Statistical analysis — let the numbers speak”
# Tip statistics grouped by gender
tips.groupby('sex')[['total_bill', 'tip', 'tip_pct']].agg(['mean', 'median', 'std'])
# Group by day
day_stats = tips.groupby('day')[['total_bill', 'tip']].agg(['mean', 'count'])
print(day_stats)
# Pivot table: average tip percentage by gender and smoker status
pivot = tips.pivot_table(
values='tip_pct',
index='sex',
columns='smoker',
aggfunc='mean'
).round(2)
print("Tip percentage (%):")
print(pivot)

Example output:

smokerNoYes
Female15.6918.22
Male16.0715.28

Finding: Female smokers have the highest tip percentage, while male smokers have the lowest.

# Correlation coefficients for numeric columns
numeric_cols = ['total_bill', 'tip', 'size', 'tip_pct', 'per_person']
corr_matrix = tips[numeric_cols].corr().round(3)
print(corr_matrix)

Key findings:

  • total_bill and tip are positively correlated (about 0.68) → the more you spend, the more tip you leave
  • total_bill and tip_pct are negatively correlated (about -0.09) → as spending increases, the tip percentage slightly decreases
  • size and total_bill are positively correlated (about 0.60) → the larger the party, the higher the spending

A beginner-friendly analysis order to remember

Section titled “A beginner-friendly analysis order to remember”

When doing EDA, a safer order is usually:

  1. First look at single-variable distributions
  2. Then look at counts of categorical variables
  3. Then look at relationships between two variables
  4. Finally do combined analysis and multi-dimensional comparisons

This is often easier to follow than jumping directly into complex facet plots at the beginning.


fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Total bill distribution
axes[0].hist(tips['total_bill'], bins=20, color='steelblue', edgecolor='white')
axes[0].set_title('Total Bill Distribution')
axes[0].set_xlabel('Amount (USD)')
axes[0].set_ylabel('Frequency')
# Tip distribution
axes[1].hist(tips['tip'], bins=20, color='coral', edgecolor='white')
axes[1].set_title('Tip Distribution')
axes[1].set_xlabel('Amount (USD)')
# Tip percentage distribution
axes[2].hist(tips['tip_pct'], bins=20, color='mediumseagreen', edgecolor='white')
axes[2].set_title('Tip Percentage (%) Distribution')
axes[2].set_xlabel('Percentage')
plt.tight_layout()
plt.savefig('01_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

Interpretation: Total bill and tip both have right-skewed distributions — most people spend between 10 and 25 USD, and most tips are between 2 and 4 USD.

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Count by day
sns.countplot(data=tips, x='day', order=['Thur', 'Fri', 'Sat', 'Sun'],
palette='Blues_d', ax=axes[0, 0])
axes[0, 0].set_title('Customer Count by Day')
# By time of day
sns.countplot(data=tips, x='time', palette='Set2', ax=axes[0, 1])
axes[0, 1].set_title('Lunch vs Dinner')
# By gender
sns.countplot(data=tips, x='sex', palette='Pastel1', ax=axes[1, 0])
axes[1, 0].set_title('Customer Gender Distribution')
# By smoking status
sns.countplot(data=tips, x='smoker', palette='Pastel2', ax=axes[1, 1])
axes[1, 1].set_title('Smoker vs Non-smoker')
plt.tight_layout()
plt.savefig('02_categorical.png', dpi=150, bbox_inches='tight')
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Scatter plot: bill vs tip
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time',
style='smoker', s=80, alpha=0.7, ax=axes[0])
axes[0].set_title('Total Bill vs Tip')
axes[0].set_xlabel('Total Bill (USD)')
axes[0].set_ylabel('Tip (USD)')
# Regression line
sns.regplot(data=tips, x='total_bill', y='tip',
scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'},
ax=axes[1])
axes[1].set_title('Total Bill vs Tip (with trend line)')
axes[1].set_xlabel('Total Bill (USD)')
axes[1].set_ylabel('Tip (USD)')
plt.tight_layout()
plt.savefig('03_bill_vs_tip.png', dpi=150, bbox_inches='tight')
plt.show()

Interpretation: As the bill amount increases, the tip also increases, showing a clear linear trend. You can also see some “outliers” — for example, someone spent more than 40 USD but only left a 1.5 USD tip.

fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Compare tip by day
sns.boxplot(data=tips, x='day', y='tip',
order=['Thur', 'Fri', 'Sat', 'Sun'],
palette='coolwarm', ax=axes[0])
axes[0].set_title('Tip Distribution by Day')
# Compare by time of day
sns.violinplot(data=tips, x='time', y='tip',
palette='Set2', ax=axes[1])
axes[1].set_title('Tip Distribution: Lunch vs Dinner')
# Compare by party size
sns.boxplot(data=tips, x='size', y='tip',
palette='YlOrRd', ax=axes[2])
axes[2].set_title('Tip by Party Size')
plt.tight_layout()
plt.savefig('04_tip_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Interpretation:

  • Sunday has the highest median tip
  • Dinner tips are overall higher than lunch tips (because dinner spending is higher)
  • Larger parties give higher tips
plt.figure(figsize=(8, 6))
# Draw heatmap
sns.heatmap(
corr_matrix,
annot=True, # show values
fmt='.2f', # keep two decimal places
cmap='RdBu_r', # red-blue palette
center=0, # center at 0
square=True, # square cells
linewidths=0.5 # grid line width
)
plt.title('Correlation Matrix of Numeric Variables')
plt.tight_layout()
plt.savefig('05_correlation.png', dpi=150, bbox_inches='tight')
plt.show()
# FacetGrid: look at the bill-tip relationship by gender and smoking status
g = sns.FacetGrid(tips, col='sex', row='smoker',
height=4, aspect=1.2, margin_titles=True)
g.map_dataframe(sns.scatterplot, x='total_bill', y='tip',
hue='time', alpha=0.7)
g.add_legend()
g.set_axis_labels('Total Bill (USD)', 'Tip (USD)')
g.fig.suptitle('Faceted by Gender × Smoking Status', y=1.02, fontsize=14)
plt.savefig('06_facet.png', dpi=150, bbox_inches='tight')
plt.show()

After a complete EDA, we can draw the following conclusions:

root((EDA Key Findings))
Spending patterns
Most spending is between 10-25 USD
Dinner spending is higher than lunch
Weekend has the most customers
Tip patterns
Average tip is about 15-16%
Higher spending means higher tip amount
But tip percentage slightly decreases
Group differences
Male spending is slightly higher than female
Difference between smokers and non-smokers is small
Party size is a key factor
  1. Bill and tip are positively correlated: the higher the total bill, the higher the tip amount (correlation coefficient 0.68), but the tip percentage decreases slightly
  2. Dinner spending is higher than lunch: both average spending and average tip are significantly higher at dinner
  3. Weekends are peak periods: Saturday and Sunday have the most customers and the highest spending
  4. Party size matters a lot: the larger the party, the higher the total bill (correlation coefficient 0.60)
  5. Gender differences are small: men and women do not differ much in tip percentage (about 1 percentage point)
  6. Smoking status has limited impact: whether someone smokes does not significantly affect tip percentage
  • Weekend dinner is the key revenue period, so service quality should be ensured
  • Encourage larger parties to dine in (more people usually means more spending and more tips)
  • Consider lunch set meals to increase midday traffic

A beginner-friendly way to write conclusions

Section titled “A beginner-friendly way to write conclusions”

Good EDA conclusions are usually not:

  • I drew a lot of charts

Instead, they should answer:

  1. What did I find?
  2. Which charts and statistics support this?
  3. What does this mean for the business?

This order is especially important because it turns your Notebook from “many charts” into “a report with real insights.”


Code integration — complete analysis script

Section titled “Code integration — complete analysis script”

Combine the above analysis into a clear, structured script:

"""
Tips dataset - Exploratory Data Analysis (EDA)
==============================================
Analysis goal: Understand the factors that influence restaurant spending and tipping
"""
# ========== 1. Imports and configuration ==========
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False
sns.set_theme(style="whitegrid", font_scale=1.1)
# ========== 2. Load data ==========
tips = sns.load_dataset("tips")
print(f"Dataset: {tips.shape[0]} rows × {tips.shape[1]} columns\n")
# ========== 3. Data overview ==========
print("=== Basic information ===")
tips.info()
print("\n=== Statistical summary ===")
print(tips.describe().round(2))
# ========== 4. Feature engineering ==========
tips['tip_pct'] = (tips['tip'] / tips['total_bill'] * 100).round(2)
tips['per_person'] = (tips['total_bill'] / tips['size']).round(2)
# ========== 5. Data quality check ==========
print(f"\nMissing values: {tips.isnull().sum().sum()}")
print(f"Duplicate rows: {tips.duplicated().sum()}")
# ========== 6. Statistical analysis ==========
print("\n=== Grouped by gender ===")
print(tips.groupby('sex')[['total_bill', 'tip', 'tip_pct']].mean().round(2))
print("\n=== Grouped by day ===")
print(tips.groupby('day')[['total_bill', 'tip']].agg(['mean', 'count']).round(2))
print("\n=== Correlation matrix ===")
print(tips[['total_bill', 'tip', 'size', 'tip_pct']].corr().round(3))
# ========== 7. Visualization ==========
# See the visualization code in Section 5 above
# Running each part step by step in Jupyter Notebook works best
print("\nAnalysis complete!")

After completing the basic EDA, try these challenges:

Do EDA with Seaborn’s built-in diamonds dataset:

diamonds = sns.load_dataset("diamonds")
print(diamonds.shape) # 53940 rows × 10 columns
print(diamonds.head())

Analysis directions:

  • Which factors affect diamond price?
  • How do cut, color, and clarity affect price?
  • Is carat and price a linear relationship?

Try using code to automatically generate a simple report:

def quick_eda(df, title="EDA Report"):
"""Quickly generate an EDA report"""
print(f"{'='*50}")
print(f" {title}")
print(f"{'='*50}")
# Basic information
print(f"\n📊 Dataset size: {df.shape[0]} rows × {df.shape[1]} columns")
# Data type statistics
print(f"\n📋 Data types:")
print(df.dtypes.value_counts().to_string())
# Missing values
missing = df.isnull().sum()
if missing.sum() > 0:
print(f"\n⚠️ Missing values:")
print(missing[missing > 0].to_string())
else:
print(f"\n✅ No missing values")
# Numeric column statistics
num_cols = df.select_dtypes(include=[np.number]).columns
if len(num_cols) > 0:
print(f"\n📈 Numeric column statistics:")
print(df[num_cols].describe().round(2).to_string())
# Categorical column statistics
cat_cols = df.select_dtypes(include=['object', 'category']).columns
for col in cat_cols:
print(f"\n🏷️ Distribution of {col}:")
print(df[col].value_counts().head(5).to_string())
return None
# Use it
quick_eda(tips, "Tips Dataset EDA")

Challenge 3: Make an interactive version with Plotly

Section titled “Challenge 3: Make an interactive version with Plotly”

If you learned Plotly in Chapter 4, try replacing static charts with interactive ones:

import plotly.express as px
# Interactive scatter plot
fig = px.scatter(
tips, x='total_bill', y='tip',
color='time', size='size',
hover_data=['sex', 'smoker', 'day'],
title='Total Bill vs Tip (Interactive)'
)
fig.show()

After finishing the project, check the following:

Check itemCompleted
Load the data and view the first few rows
Check info() and describe()
Check missing values and duplicate values
Detect outliers
Create meaningful derived features
Plot distributions of numeric variables
Plot count charts for categorical variables
Explore relationships between variables (scatter plots, box plots)
Plot a correlation heatmap
Multi-dimensional cross analysis (facet plots, pivot tables)
Write 3–5 valuable findings
Provide data-driven recommendations

A ready-to-use EDA checklist for beginners

Section titled “A ready-to-use EDA checklist for beginners”

When doing an EDA project for the first time, the safest checklist is usually:

  1. Is the data overview clear?
  2. Have missing values and outliers been explained?
  3. Have single-variable, two-variable, and grouped analyses each been done at least once?
  4. Does each key chart have a clear conclusion?
  5. Have the findings been translated into business recommendations?

If you can do these 5 things well, this project is no longer just a “plotting exercise,” but a real analysis report.


Project reference and review notes
  • There is no single numeric answer for an EDA project. A strong submission includes raw data location, data dictionary, cleaning log, summary statistics, at least three question-driven visuals, conclusions, and limitations.
  • Every visual should answer a named question and point back to the cleaned dataset. If a chart cannot be tied to a question, remove it or rewrite the question.
  • The final README should let another person reproduce the analysis and understand which decisions were judgment calls.
VersionGoalDelivery focus
Basic versionGet the minimal loop workingCan input, process, and output, while keeping one set of examples
Standard versionBuild a presentable projectAdd configuration, logs, error handling, README, and screenshots
Challenge versionGet close to portfolio qualityAdd evaluation, comparison experiments, failure sample analysis, and next-step roadmap

It is recommended to complete the basic version first; don’t aim for everything at once at the beginning. Each time you improve a version, write into the README “what new capability was added, how it was validated, and what problems remain.”

Keep this page’s proof of learning as a small evidence card:

Analysis Goal
business/data question and success criterion
Data Evidence
source, cleaning notes, features, and chart/table outputs
Result
insight, metric, dashboard, or report section
Failure Check
dirty data, biased sample, wrong aggregation, or unreproducible notebook
Expected Output
reproducible analysis folder with data, charts, and a short report