3 Data Analysis and Visualization

Main visual for data analysis and visualization

Chapter 3 has one job: help you turn messy data into a trustworthy conclusion with reproducible code and charts.

See The Data Analysis Loop

Main workflow loop of data analysis

Read the picture first. Most useful analysis follows this loop:

readinspectcleansummarizevisualizeexplain

Do not draw charts first. First understand fields, units, missing values, duplicates, and sample sources.

Learning Order And Task List

Use this checklist as both the chapter guide and the task sheet. Each step should make clear where the data came from, how it changed, and what supports the conclusion.

3.1.1 Pure Python Data Processing Follow along: process a small table with lists and dictionaries. Evidence to keep: a note explaining why tables become awkward in pure Python.
3.2.1 NumPy Overview to 3.2.7 Random and Statistics Follow along: practice arrays, shapes, slicing, broadcasting, and vectorized math. Evidence to keep: one NumPy practice file.
3.3.1 Pandas Core Structures to 3.3.8 Time Series Follow along: read a table, clean missing values, group rows, merge tables, and export results. Evidence to keep: cleaned data plus a cleaning log.
3.4.1 Matplotlib to 3.4.4 Visualization Best Practices Follow along: draw charts that answer named questions. Evidence to keep: 3 charts, each with one conclusion.
3.5.1 Relational Databases to 3.5.4 Database Design Follow along: learn enough SQL to filter, group, and join real application data. Evidence to keep: one query or join example.
3.6.1 EDA Project and 3.6.3 Follow-Along Workshop Follow along: build a reproducible data pipeline and report. Evidence to keep: raw data, clean data, chart, report, and README.

Key terms for this chapter:

Term	Meaning
`CSV`	A plain-text table where each row is a record
`DataFrame`	A Pandas table with rows, columns, names, and indexes
`Series`	One column from a DataFrame
`dtype`	The data type of a column or array
`EDA`	Exploratory Data Analysis: first-pass exploration before modeling
`groupby`	Split by category, calculate statistics, then combine
`merge` / `join`	Combine tables by shared keys

First Runnable Loop

Install the two packages once:

python -m pip install pandas matplotlib

Then run this script in an empty practice folder. It creates dirty data, cleans it, summarizes it, and saves one chart.

from io import StringIO
import pandas as pd
import matplotlib.pyplot as plt

raw = StringIO("""topic,minutes
Python,45
Pandas,30
Python,45
Visualization,
Pandas,300
""")

df = pd.read_csv(raw)
print("Before cleaning")
print(df)

clean_df = df.drop_duplicates()
clean_df["minutes"] = clean_df["minutes"].fillna(clean_df["minutes"].median())
clean_df = clean_df[clean_df["minutes"] <= 180]

summary = clean_df.groupby("topic")["minutes"].sum().sort_values(ascending=False)
print("\nAfter cleaning")
print(summary)

summary.plot(kind="bar", title="Study minutes by topic")
plt.ylabel("minutes")
plt.tight_layout()
plt.savefig("topic_minutes.png")
print("\nSaved chart: topic_minutes.png")

Expected shape:

Before cleaning
...
After cleaning
topic
Python           45.0
Visualization    ...
Saved chart: topic_minutes.png

The pass line is not “the chart looks nice.” The pass line is: you can explain which rows changed, why they changed, and how that affects the conclusion.

How to read this output

Before cleaning shows the raw evidence, including duplicates, missing values, and outliers.
After cleaning shows the transformed table you are actually using for analysis.
topic_minutes.png is the report artifact; keep it with the script that generated it.
If the conclusion changes after another cleaning rule, write that down instead of hiding it.

Depth Ladder

Level	What you can prove
Minimum pass	You can read a table, inspect shape/types/missing values, clean obvious problems, and save one chart.
Project-ready	Your report names the question, cleaning rules, summary table, chart, conclusion, limitation, and rerun command.
Deeper check	You can test whether the conclusion changes under another cleaning rule, spot leakage or sampling bias, and explain why a chart type fits the question.

Evidence to Keep

Keep this page’s proof of learning as a small evidence card:

Data Source: raw records or small dataset used
Processing Step: pure Python, NumPy, Pandas, charting, or SQL operation
Output: cleaned data, statistic, chart, query result, or report note
Failure Check: missing data, shape mismatch, wrong aggregation, or unclear question
Expected Output: data artifact plus the evidence needed to trust it

Common Failures

Symptom	First thing to check	Usual fix
Chart is pretty but conclusion is weak	Did you name the question first?	Write the question above the chart
Grouped result looks wrong	Category spaces, aliases, or inconsistent case	Print `unique()` and normalize categories
Missing values change the conclusion	Which rows and columns are missing?	Record the rule: drop, fill, or keep
Correlation looks too perfect	Time, scale, leakage, or sampling bias	Compare groups and add limitation notes
Notebook cannot rerun	Data path, dependency, or execution order	Restart and run from top to bottom

Pass Check

Move to Chapter 4 when you can answer these five questions:

What does each column mean, and what unit does it use?
Which cleaning rules changed the data?
What question does each chart answer?
What conclusion is supported, and what is still uncertain?
Can another person rerun the analysis from the README?

For a printable checklist, use 3.0 Study Guide and Task Sheet. The next chapter uses this data intuition to understand probability, vectors, gradients, and model evaluation.

Check reasoning and explanation

Use the five pass-check questions as a small data story, not as five separate slogans.
A complete answer names the columns and units, lists every cleaning rule that changed rows or values, ties each chart to one explicit question, separates supported conclusions from uncertainty, and includes a README that lets another person rerun the notebook or script.
If any answer depends on memory instead of a saved table, chart, or command output, the evidence pack is not ready yet.