Skip to main content

3 Data Analysis and Visualization

Main visual for data analysis and visualization

Chapter 3 has one job: help you turn messy data into a trustworthy conclusion with reproducible code and charts.

See The Data Analysis Loop

Main workflow loop of data analysis

Read the picture first. Most useful analysis follows this loop:

read -> inspect -> clean -> summarize -> visualize -> explain

Do not draw charts first. First understand fields, units, missing values, duplicates, and sample sources.

Learning Order And Task List

Use this table as both the chapter guide and the task sheet.

PageFollow-along actionEvidence to keep
3.1.1 Pure Python Data ProcessingProcess a small table with lists and dictionariesA note explaining why tables become awkward in pure Python
3.2.1 NumPy Overview to 3.2.7 Random and StatisticsPractice arrays, shapes, slicing, broadcasting, and vectorized mathOne NumPy practice file
3.3.1 Pandas Core Structures to 3.3.8 Time SeriesRead a table, clean missing values, group rows, merge tables, and export resultsCleaned data plus a cleaning log
3.4.1 Matplotlib to 3.4.4 Visualization Best PracticesDraw charts that answer named questions3 charts, each with one conclusion
3.5.1 Relational Databases to 3.5.4 Database DesignLearn enough SQL to filter, group, and join real application dataOne query or join example
3.6.1 EDA Project and 3.6.3 Follow-Along WorkshopBuild a reproducible data pipeline and reportRaw data, clean data, chart, report, and README

Key terms for this chapter:

TermMeaning
CSVA plain-text table where each row is a record
DataFrameA Pandas table with rows, columns, names, and indexes
SeriesOne column from a DataFrame
dtypeThe data type of a column or array
EDAExploratory Data Analysis: first-pass exploration before modeling
groupbySplit by category, calculate statistics, then combine
merge / joinCombine tables by shared keys

First Runnable Loop

Install the two packages once:

python -m pip install pandas matplotlib

Then run this script in an empty practice folder. It creates dirty data, cleans it, summarizes it, and saves one chart.

from io import StringIO
import pandas as pd
import matplotlib.pyplot as plt

raw = StringIO("""topic,minutes
Python,45
Pandas,30
Python,45
Visualization,
Pandas,300
""")

df = pd.read_csv(raw)
print("Before cleaning")
print(df)

clean_df = df.drop_duplicates()
clean_df["minutes"] = clean_df["minutes"].fillna(clean_df["minutes"].median())
clean_df = clean_df[clean_df["minutes"] <= 180]

summary = clean_df.groupby("topic")["minutes"].sum().sort_values(ascending=False)
print("\nAfter cleaning")
print(summary)

summary.plot(kind="bar", title="Study minutes by topic")
plt.ylabel("minutes")
plt.tight_layout()
plt.savefig("topic_minutes.png")
print("\nSaved chart: topic_minutes.png")

Expected shape:

Before cleaning
...
After cleaning
topic
Python 45.0
Visualization ...
Saved chart: topic_minutes.png

The pass line is not “the chart looks nice.” The pass line is: you can explain which rows changed, why they changed, and how that affects the conclusion.

Depth Ladder

LevelWhat you can prove
Minimum passYou can read a table, inspect shape/types/missing values, clean obvious problems, and save one chart.
Project-readyYour report names the question, cleaning rules, summary table, chart, conclusion, limitation, and rerun command.
Deeper checkYou can test whether the conclusion changes under another cleaning rule, spot leakage or sampling bias, and explain why a chart type fits the question.

Common Failures

SymptomFirst thing to checkUsual fix
Chart is pretty but conclusion is weakDid you name the question first?Write the question above the chart
Grouped result looks wrongCategory spaces, aliases, or inconsistent casePrint unique() and normalize categories
Missing values change the conclusionWhich rows and columns are missing?Record the rule: drop, fill, or keep
Correlation looks too perfectTime, scale, leakage, or sampling biasCompare groups and add limitation notes
Notebook cannot rerunData path, dependency, or execution orderRestart and run from top to bottom

Pass Check

Move to Chapter 4 when you can answer these five questions:

  • What does each column mean, and what unit does it use?
  • Which cleaning rules changed the data?
  • What question does each chart answer?
  • What conclusion is supported, and what is still uncertain?
  • Can another person rerun the analysis from the README?

For a printable checklist, use 3.0 Study Guide and Task Sheet. The next chapter uses this data intuition to understand probability, vectors, gradients, and model evaluation.