Skip to content

3.3.2 Pandas Core Data Structures

Pandas DataFrame structure diagram

  • Understand the role of Pandas in data analysis
  • Master Series creation and basic operations
  • Master DataFrame creation and basic attributes
  • Understand the Index mechanism

When learning Pandas for the first time, the safest order is not to “memorize all methods right away,” but to first see clearly:

flowchart LR
A["Series: a labeled column of data"] --> B["DataFrame: a table made by combining many columns"]
B --> C["Index: the positioning system for rows and columns"]
C --> D["All later filtering, cleaning, and aggregation build on these"]

So what this section really wants to solve is:

  • Why Pandas is not just “Excel in Python”
  • Why Series / DataFrame / Index become the foundation of the whole chapter

If NumPy is the engine of Python data science, then Pandas is the steering wheel and dashboard—it lets you control and inspect data conveniently.

flowchart LR
A["Raw data<br/>CSV / Excel / JSON / SQL"] --> B["Pandas<br/>read · clean · analyze"]
B --> C["Analysis results<br/>statistics · visualization · reports"]
style B fill:#ff9800,color:#fff,stroke:#e65100

Core capabilities of Pandas:

CapabilityDescription
Data I/ORead CSV, Excel, JSON, SQL in one line
Data cleaningHandle missing values, duplicates, and outliers
Data filteringFilter and query data flexibly like SQL
Grouped statisticsgroupby is dozens of times faster than pure Python loops
Data mergingMerge multiple tables like SQL JOIN

You can think of Pandas as:

  • A smart spreadsheet that remembers row and column labels

A regular list is more like:

  • A pile of raw data with no column names

NumPy is more like:

  • A matrix engine built for high-speed numerical computing

Pandas is more like:

  • The place where you really start working with “fields, row records, and table structures”

Remember the warm-up exercise in Chapter 1? 75 lines of pure Python code did the job, while Pandas did it in just 5 lines. Now let’s formally learn it.

import pandas as pd
import numpy as np
print(pd.__version__) # e.g. 2.2.0

Series: a one-dimensional array with labels

Section titled “Series: a one-dimensional array with labels”

Series is the most basic Pandas data structure—you can think of it as a NumPy array with labels.

When you first see a Series, what should you focus on?

Section titled “When you first see a Series, what should you focus on?”

The most important thing to grasp first is this sentence:

Series = one column of data + one set of labels.

Once this idea is solid, later on when you see:

  • Access by label
  • Access by position
  • Operations on an entire column

it will all feel much smoother.

import pandas as pd
# Create from a list (auto-generated 0, 1, 2... index)
s1 = pd.Series([85, 92, 78, 95, 88])
print(s1)
# 0 85
# 1 92
# 2 78
# 3 95
# 4 88
# dtype: int64
# Specify the index
s2 = pd.Series(
[85, 92, 78, 95, 88],
index=["Chinese", "Math", "English", "Physics", "Chemistry"]
)
print(s2)
# Chinese 85
# Math 92
# English 78
# Physics 95
# Chemistry 88
# dtype: int64
# Create from a dictionary (keys automatically become the index)
scores = {"Chinese": 85, "Math": 92, "English": 78, "Physics": 95}
s3 = pd.Series(scores)
print(s3)
Index Values
────────── ──────────
Chinese 85
Math 92
English 78
Physics 95
Chemistry 88

Each Series consists of two parts:

  • Index: labels used to locate data
  • Values: the actual data, with the underlying storage being a NumPy array

A beginner-friendly comparison table to remember first

Section titled “A beginner-friendly comparison table to remember first”
What you see nowWhat you can think of it as
SeriesA labeled column of data
IndexThe “row name” of this column
ValuesThe actual data itself

This table is great for beginners because it turns abstract terms back into a few more concrete roles.

s = pd.Series([85, 92, 78], index=["Chinese", "Math", "English"])
print(s.index) # Index(['Chinese', 'Math', 'English'], dtype='object')
print(s.values) # [85 92 78] ← This is a NumPy array!
print(s.dtype) # int64
print(s.shape) # (3,)
print(len(s)) # 3
s = pd.Series([85, 92, 78, 95], index=["Chinese", "Math", "English", "Physics"])
# Access by label
print(s["Math"]) # 92
# Access by position
print(s.iloc[1]) # 92
# Slicing
print(s["Chinese":"English"]) # Label slicing (includes the end!)
# Chinese 85
# Math 92
# English 78
# Boolean indexing
print(s[s >= 90])
# Math 92
# Physics 95
s = pd.Series([85, 92, 78, 95], index=["Chinese", "Math", "English", "Physics"])
# Vectorized operations (same idea as NumPy)
print(s + 5) # Add 5 points to each subject
print(s * 1.1) # Multiply each subject by 1.1
print(s.mean()) # 87.5 Average score
print(s.max()) # 95 Highest score
print(s.describe()) # Generate descriptive statistics in one line

DataFrame: a labeled two-dimensional table

Section titled “DataFrame: a labeled two-dimensional table”

DataFrame is the core of Pandas—you can think of it as an Excel table, or as a dictionary of multiple Series.

When you first see a DataFrame, what should you remember first?

Section titled “When you first see a DataFrame, what should you remember first?”

The most important thing to remember is:

DataFrame = multiple Series combined into one table using the same row index.

You can first think of it as:

  • A real data table with column names and row numbers

rather than a bunch of arrays stuck together.

# Method 1: Create from a dictionary (most common)
data = {
"Name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Liu", "Qian Qi"],
"Age": [22, 25, 23, 28, 21],
"City": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou"],
"Salary": [15000, 22000, 18000, 25000, 16000]
}
df = pd.DataFrame(data)
print(df)
# Name Age City Salary
# 0 Zhang San 22 Beijing 15000
# 1 Li Si 25 Shanghai 22000
# 2 Wang Wu 23 Guangzhou 18000
# 3 Zhao Liu 28 Shenzhen 25000
# 4 Qian Qi 21 Hangzhou 16000
# Method 2: Create from a list of lists
data = [
["Zhang San", 22, "Beijing"],
["Li Si", 25, "Shanghai"],
["Wang Wu", 23, "Guangzhou"]
]
df = pd.DataFrame(data, columns=["Name", "Age", "City"])
# Method 3: Create from a NumPy array
rng = np.random.default_rng(seed=42)
arr = rng.integers(60, 100, size=(5, 3))
df = pd.DataFrame(arr, columns=["Chinese", "Math", "English"])
# Method 4: Create from a dictionary of Series
df = pd.DataFrame({
"Math": pd.Series([90, 85, 78], index=["Zhang San", "Li Si", "Wang Wu"]),
"English": pd.Series([88, 92, 75], index=["Zhang San", "Li Si", "Wang Wu"])
})
PartExample
Index0, 1, 2, 3, 4
ColumnsName, Age, City, Salary
One rowZhang San, 22, Beijing, 15000

DataFrame = row index (Index) + column names (Columns) + data (Values)

data = {
"Name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Liu", "Qian Qi"],
"Age": [22, 25, 23, 28, 21],
"City": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou"],
"Salary": [15000, 22000, 18000, 25000, 16000]
}
df = pd.DataFrame(data)
print(df.shape) # (5, 4) → 5 rows, 4 columns
print(df.columns) # Index(['Name', 'Age', 'City', 'Salary'], dtype='object')
print(df.index) # RangeIndex(start=0, stop=5, step=1)
print(df.dtypes)
# Name object ← string
# Age int64
# City object
# Salary int64
print(df.size) # 20 → 5 × 4 = 20 elements
print(len(df)) # 5 → number of rows
# First 3 rows
print(df.head(3))
# Last 2 rows
print(df.tail(2))
# Basic information
print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 5 entries, 0 to 4
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 Name 5 non-null object
# 1 Age 5 non-null int64
# 2 City 5 non-null object
# 3 Salary 5 non-null int64
# Summary statistics for numeric columns
print(df.describe())
# Age Salary
# count 5.000000 5.000000
# mean 23.800000 19200.000000
# std 2.774887 4147.288271
# min 21.000000 15000.000000
# 25% 22.000000 16000.000000
# 50% 23.000000 18000.000000
# 75% 25.000000 22000.000000
# max 28.000000 25000.000000

A beginner-friendly sequence for a new table

Section titled “A beginner-friendly sequence for a new table”

A safer workflow is usually:

  1. Check df.head()
  2. Check df.info()
  3. Check df.describe()
  4. Then start filtering and cleaning

This is much less likely to make you lose your way than jumping straight into complex operations.

# Access a single column → returns a Series
print(df["Name"])
# 0 Zhang San
# 1 Li Si
# ...
# You can also use dot notation (when the column name has no spaces and does not conflict with methods)
print(df.Age)
# Access multiple columns → returns a DataFrame
print(df[["Name", "Salary"]])
# Name Salary
# 0 Zhang San 15000
# 1 Li Si 22000
# ...

Why is “learning to read columns first” so important?

Section titled “Why is “learning to read columns first” so important?”

Because most Pandas work later on does three things:

  • Select columns
  • Modify columns
  • Perform statistics and combinations based on columns

So when learning Pandas for the first time, instead of rushing to memorize lots of advanced methods, it’s better to first make sure you understand “How do I find this column, what type is it, and what can I do with it?”

# Add a new column
df["After-Tax Salary"] = df["Salary"] * 0.85
print(df[["Name", "Salary", "After-Tax Salary"]])
# Add a column based on a condition
df["Salary Level"] = np.where(df["Salary"] >= 20000, "High", "Medium")
print(df[["Name", "Salary", "Salary Level"]])
# Delete a column
df = df.drop(columns=["After-Tax Salary"]) # returns a new DataFrame
# or
# df.drop(columns=["After-Tax Salary"], inplace=True) # modify in place

Index is the key feature that distinguishes Pandas from NumPy.

df = pd.DataFrame({
"Name": ["Zhang San", "Li Si", "Wang Wu"],
"Age": [22, 25, 23],
"Salary": [15000, 22000, 18000]
})
# Set the "Name" column as the index
df_indexed = df.set_index("Name")
print(df_indexed)
# Age Salary
# Name
# Zhang San 22 15000
# Li Si 25 22000
# Wang Wu 23 18000
# Access by index
print(df_indexed.loc["Li Si"])
# Age 25
# Salary 22000
# Reset the index
df_reset = df_indexed.reset_index()
print(df_reset) # same as before

Pandas operations automatically align by index—this is a very powerful feature:

s1 = pd.Series({"Chinese": 85, "Math": 92, "English": 78})
s2 = pd.Series({"Math": 88, "English": 82, "Physics": 90})
# Automatically align by index when adding
result = s1 + s2
print(result)
# English 160.0
# Math 180.0
# Physics NaN ← s1 does not have Physics, so the result is NaN
# Chinese NaN ← s2 does not have Chinese, so the result is NaN

FeatureSeriesDataFrame
Dimension1D2D
AnalogyOne column in ExcelAn entire Excel table
Creationpd.Series([1,2,3])pd.DataFrame({"a":[1,2]})
Accessing a columndf["column_name"] returns a Series
IndexOne IndexRow index + column index

Keep this page’s proof of learning as a small evidence card:

Dataframe State
columns, dtypes, row count, missing values, and sample rows
Operation
read/write, select/filter, clean, transform, groupby, merge, or time-series step
Output
resulting table, saved file, aggregation, join result, or time index view
Failure Check
dtype mismatch, missing data, duplicated keys, chained assignment, or wrong time frequency
Expected Output
before/after table sample with the transformation reason
root((Pandas Core Data Structures))
Series
Labeled one-dimensional array
index + values
Like an enhanced dictionary
DataFrame
Labeled two-dimensional table
index + columns + values
Like an Excel table
Index
Automatic alignment
set_index / reset_index
Label-based access
Common methods
head / tail
info / describe
shape / dtypes / columns

# Create a Series representing daily step counts for one week
# Use "Monday" to "Sunday" as the index
# 1. Print the average step count
# 2. Find the day with the most steps
# 3. Find the days with more than 8000 steps
# Create a feature progress DataFrame containing:
# Feature, Owner, Planned Hours, Actual Hours, Status columns, at least 5 items
# 1. Add a "Delta Hours" column
# 2. Add an "Over Budget" column
# 3. Add a "Risk" column (Blocked -> High, Review -> Medium, Done -> Low, otherwise Watch)
# 4. Use describe() to view statistics for the numeric columns
# Use the DataFrame from Exercise 2
# 1. Set "Feature" as the index
# 2. Find one feature's full progress record by name
# 3. Reset the index
Reference implementation and walkthrough
  • For the weekly step-count Series, use weekday names as the index, then compute the mean, max day with idxmax, and filtered high-step days with a boolean condition.
  • For the feature progress DataFrame, add Delta Hours and Over Budget, then create Risk with a function, map, or pd.cut depending on your rule. describe() is useful evidence, but it is not the full analysis.
  • Index practice should show both set_index and reset_index. A good answer explains when label lookup with .loc is clearer than positional lookup with .iloc.