3.3.2 Pandas Core Data Structures

Learning objectives
Section titled “Learning objectives”- Understand the role of Pandas in data analysis
- Master Series creation and basic operations
- Master DataFrame creation and basic attributes
- Understand the Index mechanism
First build a map
Section titled “First build a map”When learning Pandas for the first time, the safest order is not to “memorize all methods right away,” but to first see clearly:
flowchart LR A["Series: a labeled column of data"] --> B["DataFrame: a table made by combining many columns"] B --> C["Index: the positioning system for rows and columns"] C --> D["All later filtering, cleaning, and aggregation build on these"]So what this section really wants to solve is:
- Why
Pandasis not just “Excel in Python” - Why
Series / DataFrame / Indexbecome the foundation of the whole chapter
What is Pandas?
Section titled “What is Pandas?”If NumPy is the engine of Python data science, then Pandas is the steering wheel and dashboard—it lets you control and inspect data conveniently.
flowchart LR A["Raw data<br/>CSV / Excel / JSON / SQL"] --> B["Pandas<br/>read · clean · analyze"] B --> C["Analysis results<br/>statistics · visualization · reports"]
style B fill:#ff9800,color:#fff,stroke:#e65100Core capabilities of Pandas:
| Capability | Description |
|---|---|
| Data I/O | Read CSV, Excel, JSON, SQL in one line |
| Data cleaning | Handle missing values, duplicates, and outliers |
| Data filtering | Filter and query data flexibly like SQL |
| Grouped statistics | groupby is dozens of times faster than pure Python loops |
| Data merging | Merge multiple tables like SQL JOIN |
A better beginner-friendly analogy
Section titled “A better beginner-friendly analogy”You can think of Pandas as:
- A smart spreadsheet that remembers row and column labels
A regular list is more like:
- A pile of raw data with no column names
NumPy is more like:
- A matrix engine built for high-speed numerical computing
Pandas is more like:
- The place where you really start working with “fields, row records, and table structures”
Remember the warm-up exercise in Chapter 1? 75 lines of pure Python code did the job, while Pandas did it in just 5 lines. Now let’s formally learn it.
import pandas as pdimport numpy as np
print(pd.__version__) # e.g. 2.2.0Series: a one-dimensional array with labels
Section titled “Series: a one-dimensional array with labels”Series is the most basic Pandas data structure—you can think of it as a NumPy array with labels.
When you first see a Series, what should you focus on?
Section titled “When you first see a Series, what should you focus on?”The most important thing to grasp first is this sentence:
Series = one column of data + one set of labels.
Once this idea is solid, later on when you see:
- Access by label
- Access by position
- Operations on an entire column
it will all feel much smoother.
Creating a Series
Section titled “Creating a Series”import pandas as pd
# Create from a list (auto-generated 0, 1, 2... index)s1 = pd.Series([85, 92, 78, 95, 88])print(s1)# 0 85# 1 92# 2 78# 3 95# 4 88# dtype: int64
# Specify the indexs2 = pd.Series( [85, 92, 78, 95, 88], index=["Chinese", "Math", "English", "Physics", "Chemistry"])print(s2)# Chinese 85# Math 92# English 78# Physics 95# Chemistry 88# dtype: int64
# Create from a dictionary (keys automatically become the index)scores = {"Chinese": 85, "Math": 92, "English": 78, "Physics": 95}s3 = pd.Series(scores)print(s3)Structure of a Series
Section titled “Structure of a Series”Index Values────────── ──────────Chinese 85Math 92English 78Physics 95Chemistry 88Each Series consists of two parts:
- Index: labels used to locate data
- Values: the actual data, with the underlying storage being a NumPy array
A beginner-friendly comparison table to remember first
Section titled “A beginner-friendly comparison table to remember first”| What you see now | What you can think of it as |
|---|---|
Series | A labeled column of data |
Index | The “row name” of this column |
Values | The actual data itself |
This table is great for beginners because it turns abstract terms back into a few more concrete roles.
s = pd.Series([85, 92, 78], index=["Chinese", "Math", "English"])
print(s.index) # Index(['Chinese', 'Math', 'English'], dtype='object')print(s.values) # [85 92 78] ← This is a NumPy array!print(s.dtype) # int64print(s.shape) # (3,)print(len(s)) # 3Accessing a Series
Section titled “Accessing a Series”s = pd.Series([85, 92, 78, 95], index=["Chinese", "Math", "English", "Physics"])
# Access by labelprint(s["Math"]) # 92
# Access by positionprint(s.iloc[1]) # 92
# Slicingprint(s["Chinese":"English"]) # Label slicing (includes the end!)# Chinese 85# Math 92# English 78
# Boolean indexingprint(s[s >= 90])# Math 92# Physics 95Operations on a Series
Section titled “Operations on a Series”s = pd.Series([85, 92, 78, 95], index=["Chinese", "Math", "English", "Physics"])
# Vectorized operations (same idea as NumPy)print(s + 5) # Add 5 points to each subjectprint(s * 1.1) # Multiply each subject by 1.1print(s.mean()) # 87.5 Average scoreprint(s.max()) # 95 Highest scoreprint(s.describe()) # Generate descriptive statistics in one lineDataFrame: a labeled two-dimensional table
Section titled “DataFrame: a labeled two-dimensional table”DataFrame is the core of Pandas—you can think of it as an Excel table, or as a dictionary of multiple Series.
When you first see a DataFrame, what should you remember first?
Section titled “When you first see a DataFrame, what should you remember first?”The most important thing to remember is:
DataFrame = multiple Series combined into one table using the same row index.
You can first think of it as:
- A real data table with column names and row numbers
rather than a bunch of arrays stuck together.
Creating a DataFrame
Section titled “Creating a DataFrame”# Method 1: Create from a dictionary (most common)data = { "Name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Liu", "Qian Qi"], "Age": [22, 25, 23, 28, 21], "City": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou"], "Salary": [15000, 22000, 18000, 25000, 16000]}df = pd.DataFrame(data)print(df)# Name Age City Salary# 0 Zhang San 22 Beijing 15000# 1 Li Si 25 Shanghai 22000# 2 Wang Wu 23 Guangzhou 18000# 3 Zhao Liu 28 Shenzhen 25000# 4 Qian Qi 21 Hangzhou 16000# Method 2: Create from a list of listsdata = [ ["Zhang San", 22, "Beijing"], ["Li Si", 25, "Shanghai"], ["Wang Wu", 23, "Guangzhou"]]df = pd.DataFrame(data, columns=["Name", "Age", "City"])
# Method 3: Create from a NumPy arrayrng = np.random.default_rng(seed=42)arr = rng.integers(60, 100, size=(5, 3))df = pd.DataFrame(arr, columns=["Chinese", "Math", "English"])
# Method 4: Create from a dictionary of Seriesdf = pd.DataFrame({ "Math": pd.Series([90, 85, 78], index=["Zhang San", "Li Si", "Wang Wu"]), "English": pd.Series([88, 92, 75], index=["Zhang San", "Li Si", "Wang Wu"])})Structure of a DataFrame
Section titled “Structure of a DataFrame”| Part | Example |
|---|---|
| Index | 0, 1, 2, 3, 4 |
| Columns | Name, Age, City, Salary |
| One row | Zhang San, 22, Beijing, 15000 |
DataFrame = row index (Index) + column names (Columns) + data (Values)
Basic attributes
Section titled “Basic attributes”data = { "Name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Liu", "Qian Qi"], "Age": [22, 25, 23, 28, 21], "City": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou"], "Salary": [15000, 22000, 18000, 25000, 16000]}df = pd.DataFrame(data)
print(df.shape) # (5, 4) → 5 rows, 4 columnsprint(df.columns) # Index(['Name', 'Age', 'City', 'Salary'], dtype='object')print(df.index) # RangeIndex(start=0, stop=5, step=1)print(df.dtypes)# Name object ← string# Age int64# City object# Salary int64print(df.size) # 20 → 5 × 4 = 20 elementsprint(len(df)) # 5 → number of rowsQuickly inspect data
Section titled “Quickly inspect data”# First 3 rowsprint(df.head(3))
# Last 2 rowsprint(df.tail(2))
# Basic informationprint(df.info())# <class 'pandas.core.frame.DataFrame'># RangeIndex: 5 entries, 0 to 4# Data columns (total 4 columns):# # Column Non-Null Count Dtype# --- ------ -------------- -----# 0 Name 5 non-null object# 1 Age 5 non-null int64# 2 City 5 non-null object# 3 Salary 5 non-null int64
# Summary statistics for numeric columnsprint(df.describe())# Age Salary# count 5.000000 5.000000# mean 23.800000 19200.000000# std 2.774887 4147.288271# min 21.000000 15000.000000# 25% 22.000000 16000.000000# 50% 23.000000 18000.000000# 75% 25.000000 22000.000000# max 28.000000 25000.000000A beginner-friendly sequence for a new table
Section titled “A beginner-friendly sequence for a new table”A safer workflow is usually:
- Check
df.head() - Check
df.info() - Check
df.describe() - Then start filtering and cleaning
This is much less likely to make you lose your way than jumping straight into complex operations.
Accessing columns
Section titled “Accessing columns”# Access a single column → returns a Seriesprint(df["Name"])# 0 Zhang San# 1 Li Si# ...
# You can also use dot notation (when the column name has no spaces and does not conflict with methods)print(df.Age)
# Access multiple columns → returns a DataFrameprint(df[["Name", "Salary"]])# Name Salary# 0 Zhang San 15000# 1 Li Si 22000# ...Why is “learning to read columns first” so important?
Section titled “Why is “learning to read columns first” so important?”Because most Pandas work later on does three things:
- Select columns
- Modify columns
- Perform statistics and combinations based on columns
So when learning Pandas for the first time, instead of rushing to memorize lots of advanced methods, it’s better to first make sure you understand “How do I find this column, what type is it, and what can I do with it?”
Adding and deleting columns
Section titled “Adding and deleting columns”# Add a new columndf["After-Tax Salary"] = df["Salary"] * 0.85print(df[["Name", "Salary", "After-Tax Salary"]])
# Add a column based on a conditiondf["Salary Level"] = np.where(df["Salary"] >= 20000, "High", "Medium")print(df[["Name", "Salary", "Salary Level"]])
# Delete a columndf = df.drop(columns=["After-Tax Salary"]) # returns a new DataFrame# or# df.drop(columns=["After-Tax Salary"], inplace=True) # modify in placeWhy Index matters
Section titled “Why Index matters”Index is the key feature that distinguishes Pandas from NumPy.
Setting the index
Section titled “Setting the index”df = pd.DataFrame({ "Name": ["Zhang San", "Li Si", "Wang Wu"], "Age": [22, 25, 23], "Salary": [15000, 22000, 18000]})
# Set the "Name" column as the indexdf_indexed = df.set_index("Name")print(df_indexed)# Age Salary# Name# Zhang San 22 15000# Li Si 25 22000# Wang Wu 23 18000
# Access by indexprint(df_indexed.loc["Li Si"])# Age 25# Salary 22000
# Reset the indexdf_reset = df_indexed.reset_index()print(df_reset) # same as beforeIndex alignment
Section titled “Index alignment”Pandas operations automatically align by index—this is a very powerful feature:
s1 = pd.Series({"Chinese": 85, "Math": 92, "English": 78})s2 = pd.Series({"Math": 88, "English": 82, "Physics": 90})
# Automatically align by index when addingresult = s1 + s2print(result)# English 160.0# Math 180.0# Physics NaN ← s1 does not have Physics, so the result is NaN# Chinese NaN ← s2 does not have Chinese, so the result is NaNSeries vs DataFrame comparison
Section titled “Series vs DataFrame comparison”| Feature | Series | DataFrame |
|---|---|---|
| Dimension | 1D | 2D |
| Analogy | One column in Excel | An entire Excel table |
| Creation | pd.Series([1,2,3]) | pd.DataFrame({"a":[1,2]}) |
| Accessing a column | — | df["column_name"] returns a Series |
| Index | One Index | Row index + column index |
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Dataframe State
- columns, dtypes, row count, missing values, and sample rows
- Operation
- read/write, select/filter, clean, transform, groupby, merge, or time-series step
- Output
- resulting table, saved file, aggregation, join result, or time index view
- Failure Check
- dtype mismatch, missing data, duplicated keys, chained assignment, or wrong time frequency
- Expected Output
- before/after table sample with the transformation reason
Summary
Section titled “Summary” root((Pandas Core Data Structures)) Series Labeled one-dimensional array index + values Like an enhanced dictionary DataFrame Labeled two-dimensional table index + columns + values Like an Excel table Index Automatic alignment set_index / reset_index Label-based access Common methods head / tail info / describe shape / dtypes / columnsHands-on exercises
Section titled “Hands-on exercises”Exercise 1: Create a Series
Section titled “Exercise 1: Create a Series”# Create a Series representing daily step counts for one week# Use "Monday" to "Sunday" as the index# 1. Print the average step count# 2. Find the day with the most steps# 3. Find the days with more than 8000 stepsExercise 2: Create a DataFrame
Section titled “Exercise 2: Create a DataFrame”# Create a feature progress DataFrame containing:# Feature, Owner, Planned Hours, Actual Hours, Status columns, at least 5 items# 1. Add a "Delta Hours" column# 2. Add an "Over Budget" column# 3. Add a "Risk" column (Blocked -> High, Review -> Medium, Done -> Low, otherwise Watch)# 4. Use describe() to view statistics for the numeric columnsExercise 3: Index operations
Section titled “Exercise 3: Index operations”# Use the DataFrame from Exercise 2# 1. Set "Feature" as the index# 2. Find one feature's full progress record by name# 3. Reset the indexReference implementation and walkthrough
- For the weekly step-count Series, use weekday names as the index, then compute the mean, max day with
idxmax, and filtered high-step days with a boolean condition. - For the feature progress DataFrame, add
Delta HoursandOver Budget, then createRiskwith a function,map, orpd.cutdepending on your rule.describe()is useful evidence, but it is not the full analysis. - Index practice should show both
set_indexandreset_index. A good answer explains when label lookup with.locis clearer than positional lookup with.iloc.