Skip to content

3 Data Analysis and Visualization

Main visual for data analysis and visualization

Chapter 3 has one job: help you turn messy data into a trustworthy conclusion with reproducible code and charts.

Main workflow loop of data analysis

Read the picture first. Most useful analysis follows this loop:

readinspectcleansummarizevisualizeexplain

Do not draw charts first. First understand fields, units, missing values, duplicates, and sample sources.

Use this checklist as both the chapter guide and the task sheet. Each step should make clear where the data came from, how it changed, and what supports the conclusion.

  1. 3.1.1 Pure Python Data Processing Follow along: process a small table with lists and dictionaries. Evidence to keep: a note explaining why tables become awkward in pure Python.

  2. 3.2.1 NumPy Overview to 3.2.7 Random and Statistics Follow along: practice arrays, shapes, slicing, broadcasting, and vectorized math. Evidence to keep: one NumPy practice file.

  3. 3.3.1 Pandas Core Structures to 3.3.8 Time Series Follow along: read a table, clean missing values, group rows, merge tables, and export results. Evidence to keep: cleaned data plus a cleaning log.

  4. 3.4.1 Matplotlib to 3.4.4 Visualization Best Practices Follow along: draw charts that answer named questions. Evidence to keep: 3 charts, each with one conclusion.

  5. 3.5.1 Relational Databases to 3.5.4 Database Design Follow along: learn enough SQL to filter, group, and join real application data. Evidence to keep: one query or join example.

  6. 3.6.1 EDA Project and 3.6.3 Follow-Along Workshop Follow along: build a reproducible data pipeline and report. Evidence to keep: raw data, clean data, chart, report, and README.

Key terms for this chapter:

TermMeaning
CSVA plain-text table where each row is a record
DataFrameA Pandas table with rows, columns, names, and indexes
SeriesOne column from a DataFrame
dtypeThe data type of a column or array
EDAExploratory Data Analysis: first-pass exploration before modeling
groupbySplit by category, calculate statistics, then combine
merge / joinCombine tables by shared keys

Install the two packages once:

Terminal window
python -m pip install pandas matplotlib

Then run this script in an empty practice folder. It creates dirty data, cleans it, summarizes it, and saves one chart.

from io import StringIO
import pandas as pd
import matplotlib.pyplot as plt
raw = StringIO("""topic,minutes
Python,45
Pandas,30
Python,45
Visualization,
Pandas,300
""")
df = pd.read_csv(raw)
print("Before cleaning")
print(df)
clean_df = df.drop_duplicates()
clean_df["minutes"] = clean_df["minutes"].fillna(clean_df["minutes"].median())
clean_df = clean_df[clean_df["minutes"] <= 180]
summary = clean_df.groupby("topic")["minutes"].sum().sort_values(ascending=False)
print("\nAfter cleaning")
print(summary)
summary.plot(kind="bar", title="Study minutes by topic")
plt.ylabel("minutes")
plt.tight_layout()
plt.savefig("topic_minutes.png")
print("\nSaved chart: topic_minutes.png")

Expected shape:

Before cleaning
...
After cleaning
topic
Python 45.0
Visualization ...
Saved chart: topic_minutes.png

The pass line is not “the chart looks nice.” The pass line is: you can explain which rows changed, why they changed, and how that affects the conclusion.

  • Before cleaning shows the raw evidence, including duplicates, missing values, and outliers.
  • After cleaning shows the transformed table you are actually using for analysis.
  • topic_minutes.png is the report artifact; keep it with the script that generated it.
  • If the conclusion changes after another cleaning rule, write that down instead of hiding it.
LevelWhat you can prove
Minimum passYou can read a table, inspect shape/types/missing values, clean obvious problems, and save one chart.
Project-readyYour report names the question, cleaning rules, summary table, chart, conclusion, limitation, and rerun command.
Deeper checkYou can test whether the conclusion changes under another cleaning rule, spot leakage or sampling bias, and explain why a chart type fits the question.

Keep this page’s proof of learning as a small evidence card:

Data Source
raw records or small dataset used
Processing Step
pure Python, NumPy, Pandas, charting, or SQL operation
Output
cleaned data, statistic, chart, query result, or report note
Failure Check
missing data, shape mismatch, wrong aggregation, or unclear question
Expected Output
data artifact plus the evidence needed to trust it
SymptomFirst thing to checkUsual fix
Chart is pretty but conclusion is weakDid you name the question first?Write the question above the chart
Grouped result looks wrongCategory spaces, aliases, or inconsistent casePrint unique() and normalize categories
Missing values change the conclusionWhich rows and columns are missing?Record the rule: drop, fill, or keep
Correlation looks too perfectTime, scale, leakage, or sampling biasCompare groups and add limitation notes
Notebook cannot rerunData path, dependency, or execution orderRestart and run from top to bottom

Move to Chapter 4 when you can answer these five questions:

  • What does each column mean, and what unit does it use?
  • Which cleaning rules changed the data?
  • What question does each chart answer?
  • What conclusion is supported, and what is still uncertain?
  • Can another person rerun the analysis from the README?

For a printable checklist, use 3.0 Study Guide and Task Sheet. The next chapter uses this data intuition to understand probability, vectors, gradients, and model evaluation.

Check reasoning and explanation
  • Use the five pass-check questions as a small data story, not as five separate slogans.
  • A complete answer names the columns and units, lists every cleaning rule that changed rows or values, ties each chart to one explicit question, separates supported conclusions from uncertainty, and includes a README that lets another person rerun the notebook or script.
  • If any answer depends on memory instead of a saved table, chart, or command output, the evidence pack is not ready yet.