Skip to content

3.3.6 Data Transformation

  • Understand how apply, map, and applymap work and how they differ
  • Learn sorting (sort_values) and ranking (rank)
  • Master data replacement and mapping

Data transformation is easier to understand by asking: “What do I want this column to become?”

Pandas data transformation pipeline

So what this section really aims to solve is:

  • What each transformation action is used for
  • When to think of map first, and when to think of apply first

apply: Apply a function to rows or columns

Section titled “apply: Apply a function to rows or columns”

Pandas transform method choice diagram

apply is one of Pandas’ most flexible transformation tools — it can apply any function to each row or each column.

You can think of data transformation as:

  • “translating, processing, and relabeling” raw data

Sometimes you just want to:

  • translate codes into English

Sometimes you want to:

  • calculate a new result from several columns in one row

Sometimes you want to:

  • split continuous numbers into three levels: high, medium, and low

All of these look like “transformations,” but they are actually different types of problems.

import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Liu"],
"math": [85, 92, 78, 95],
"english": [90, 88, 72, 85]
})
# Apply a built-in function to a single column
print(df["math"].apply(np.sqrt)) # square root of each score
# Apply a custom function to a single column
def grade(score):
if score >= 90: return "excellent"
elif score >= 80: return "good"
elif score >= 70: return "average"
else: return "pass"
df["math_grade"] = df["math"].apply(grade)
print(df)
# Use lambda for a more concise expression
df["english_grade"] = df["english"].apply(lambda x: "pass" if x >= 60 else "fail")
# axis=1 means operate on each row
df["total"] = df[["math", "english"]].apply(np.sum, axis=1)
# Custom row operation
def student_info(row):
return f"{row['name']}'s math score is {row['math']}"
df["description"] = df.apply(student_info, axis=1)
print(df[["name", "description"]])

When learning apply for the first time, what should you remember first?

Section titled “When learning apply for the first time, what should you remember first?”

The most important thing to remember is:

apply is best for custom calculations when built-in methods are not enough.

In other words, it is not the first tool you should reach for, but more like something you use when:

  • the rule is a bit complex and can’t be handled directly by one or two built-in methods

map is used on a Series to map old values to new values:

df = pd.DataFrame({
"name": ["Zhang San", "Li Si", "Wang Wu"],
"gender": ["M", "F", "M"],
"department_code": [1, 2, 1]
})
# Map with a dictionary
df["gender_cn"] = df["gender"].map({"M": "male", "F": "female"})
# Department code mapping
dept_map = {1: "Engineering", 2: "Marketing", 3: "Management"}
df["department_name"] = df["department_code"].map(dept_map)
# Map with a function
df["name_length"] = df["name"].map(len)
print(df)

When your brain is thinking about:

  • code A -> name A
  • M / F -> male / female
  • month abbreviation -> month name

In this kind of “one value maps to one value” translation relationship, you usually should think of:

  • map
Featuremapapply
Target objectSeries onlySeries or DataFrame
Supports dictionary mapping
Supports functions
Row-wise operation✅ (axis=1)

df = pd.DataFrame({
"city": ["BJ", "SH", "GZ", "SZ", "BJ"],
"level": ["A", "B", "C", "A", "B"]
})
# Replace a single value
df["city"] = df["city"].replace("BJ", "Beijing")
# Replace multiple values (dictionary)
city_map = {"SH": "Shanghai", "GZ": "Guangzhou", "SZ": "Shenzhen"}
df["city"] = df["city"].replace(city_map)
print(df)

Where do map and replace get mixed up most easily?

Section titled “Where do map and replace get mixed up most easily?”

A simple way to remember it is:

  • map is more like “mapping and translating”
  • replace is more like “directly swapping out old values”

If your goal is:

  • converting a whole set of codes into names

that is usually more like map; if you just want to:

  • replace a dirty value

that is usually more like replace.


df = pd.DataFrame({
"name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Liu", "Qian Qi"],
"age": [22, 28, 25, 35, 21],
"salary": [15000, 22000, 18000, 35000, 12000]
})
# Sort by salary ascending
print(df.sort_values("salary"))
# Sort by salary descending
print(df.sort_values("salary", ascending=False))
# Multi-column sort: first by age ascending, and if ages are the same, by salary descending
print(df.sort_values(["age", "salary"], ascending=[True, False]))
# Get top 3 (recommended: nlargest)
print(df.nlargest(3, "salary"))
# Get bottom 3
print(df.nsmallest(3, "salary"))
df_indexed = df.set_index("name")
print(df_indexed.sort_index()) # sort by name
print(df_indexed.sort_index(ascending=False))

df = pd.DataFrame({
"name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Liu", "Qian Qi"],
"score": [85, 92, 78, 92, 88]
})
# Default ranking (equal values get the average rank)
df["rank"] = df["score"].rank(ascending=False)
print(df)
# name score rank
# 0 Zhang San 85 4.0
# 1 Li Si 92 1.5 ← tied for 1st, average of (1+2)
# 2 Wang Wu 78 5.0
# 3 Zhao Liu 92 1.5
# 4 Qian Qi 88 3.0
# Different ranking strategies
df["min_rank"] = df["score"].rank(ascending=False, method="min") # tied values take the smallest rank
df["max_rank"] = df["score"].rank(ascending=False, method="max") # tied values take the largest rank
df["dense_rank"] = df["score"].rank(ascending=False, method="dense") # no gaps in ranking
print(df[["name", "score", "rank", "min_rank", "dense_rank"]])
methodTie handlingExample (92, 92)
averageTake the average1.5, 1.5
minTake the minimum1, 1
maxTake the maximum2, 2
denseDense ranking (no gaps)1, 1 (next is 2)
firstBy order of appearance1, 2

A very practical choice table for beginners

Section titled “A very practical choice table for beginners”
What do you want to do nowBetter first choice
Translate codes into English labelsmap
Calculate a new result from several columns in one rowapply(axis=1)
Find Top N / sortsort_values / nlargest
Rank valuesrank
Split continuous values into intervalscut / qcut

This table is especially useful for beginners, because it turns “there are many transformation methods” back into a few very common problems.


df = pd.DataFrame({
"department": ["Engineering", "Marketing", "Engineering", "Management", "Engineering", "Marketing"]
})
# Count how many times each value appears
print(df["department"].value_counts())
# Engineering 3
# Marketing 2
# Management 1
# Proportion
print(df["department"].value_counts(normalize=True))
print(df["department"].unique()) # ['Engineering' 'Marketing' 'Management']
print(df["department"].nunique()) # 3 (number of unique values)
ages = pd.Series([18, 22, 25, 30, 35, 42, 55, 68])
# Bin by fixed intervals
bins = [0, 18, 30, 50, 100]
labels = ["teen", "young adult", "middle-aged", "senior"]
age_group = pd.cut(ages, bins=bins, labels=labels)
print(age_group)
# Bin by quantiles (each group has roughly the same number of people)
quartile_group = pd.qcut(ages, q=4, labels=["Q1", "Q2", "Q3", "Q4"])
print(quartile_group)

Keep this page’s proof of learning as a small evidence card:

Dataframe State
columns, dtypes, row count, missing values, and sample rows
Operation
read/write, select/filter, clean, transform, groupby, merge, or time-series step
Output
resulting table, saved file, aggregation, join result, or time index view
Failure Check
dtype mismatch, missing data, duplicated keys, chained assignment, or wrong time frequency
Expected Output
before/after table sample with the transformation reason
OperationMethodCommon use
Custom transformationapply()Complex row-wise/column-wise calculations
Value mappingmap()Dictionary mapping, code conversion
Value replacementreplace()Fixing incorrect values
Sortingsort_values()Top N, ranking lists
Rankingrank()Score ranking
Counting valuesvalue_counts()Category statistics
Binningcut() / qcut()Age bands, income bands

What should you take away from this section?

Section titled “What should you take away from this section?”
  • The most important thing in data transformation is not the function name, but first figuring out what you want the data to become
  • map is more like mapping and translation, while apply is more like custom processing
  • Sorting, ranking, and binning are all essentially ways of reorganizing how data is expressed

# Create data that contains English month abbreviations
# 1. Map month abbreviations to month names
# 2. Map months to quarters (Q1, Q2, Q3, Q4)
# Create a DataFrame with scores for 3 subjects for 20 students
# 1. Calculate the total score
# 2. Rank by total score (dense ranking)
# 3. Sort by total score and take the top 5
# 4. Label each subject score with a grade (excellent/good/average/pass/fail)
# You have spending data for 100 users
# 1. Use cut to divide spending into three levels: "low spending / medium spending / high spending"
# 2. Use qcut to split them evenly into 5 groups
# 3. Count the number of users and the average spending in each group
Reference implementation and walkthrough
  • Month or category recoding should use an explicit dictionary and then verify unmapped values with isna() or value_counts. Silent unmapped categories are a common source of wrong summaries.
  • For derived sales fields, compute totals, ranks, and top items with vectorized operations such as multiplication, rank, sort_values, and nlargest.
  • For bins, choose cut when the business thresholds are fixed and qcut when you want roughly equal-sized groups. Always print the group counts before interpreting the labels.