3.2.7 Random Numbers and Statistics

NumPy Random Numbers and Statistics Map

Learning Objectives

Master the common functions in the numpy.random module
Understand common probability distributions (uniform, normal, binomial)
Understand the role of the random seed
Learn to use NumPy for basic statistical operations

Why Do We Need Random Numbers?

In data science and AI, random numbers are everywhere:

Scenario	Why random numbers are needed
Dataset splitting	Randomly split the training set and test set
Model initialization	Neural network weights need random initialization
Data augmentation	Randomly crop, rotate, and flip images
Monte Carlo simulation	Use random sampling to estimate complex problems
A/B testing	Randomly assign users to the control group and the experiment group

numpy.random Basics

New API (Recommended)

NumPy recommends using the newer Generator API:

import numpy as np

# Create a random number generator
rng = np.random.default_rng(seed=42)

# Uniform random numbers in [0, 1)
print(rng.random(5))
# [0.773... 0.438... 0.858... 0.697... 0.094...]

# Random integers in a specified range
print(rng.integers(1, 100, size=5))
# [67 82 42 91 23] (example values)

# Normal random numbers
print(rng.standard_normal(5))
# [-0.15... 0.74... -0.27... ...]

Old API (Still Commonly Used)

You will still see the old API in many tutorials and code examples, so you need to recognize it too:

# Old-style usage (still valid)
np.random.seed(42)  # Set the global seed

# Uniform random numbers in [0, 1)
print(np.random.rand(3))

# Standard normal distribution
print(np.random.randn(3))

# Random integers
print(np.random.randint(1, 100, size=5))

New vs. Old

New default_rng(): more flexible, supports independent random states, and is recommended for new code
Old np.random.xxx(): uses global state, simple and direct, and very common in older code

You should recognize both styles. This tutorial covers both.

Random Seed: Making "Random" Reproducible

In scientific research and debugging, we often need "reproducible randomness" — getting the same results every time we run the code.

# No seed: different results every time
print(np.random.rand(3))  # Different each time

# Set a seed: same results every time
np.random.seed(42)
print(np.random.rand(3))  # [0.374... 0.950... 0.731...]

np.random.seed(42)        # Reset the same seed
print(np.random.rand(3))  # [0.374... 0.950... 0.731...]  Exactly the same!

# Seed setting with the new API
rng = np.random.default_rng(seed=42)
print(rng.random(3))

rng2 = np.random.default_rng(seed=42)  # Same seed
print(rng2.random(3))                    # Same result

What a Seed Does

A random seed is like a "recipe for random numbers" — the same seed always produces the same sequence of random numbers. Be sure to set a seed in the following situations:

During learning/tutorials: to make results easier to verify
In scientific experiments: to ensure results are reproducible
When debugging code: to remove interference from randomness
When training machine learning models: to ensure fair comparison experiments

Common Probability Distributions

Uniform Distribution

Each value has the same probability of appearing:

rng = np.random.default_rng(42)

# Uniform distribution between [0, 1)
uniform_01 = rng.random(10000)
print(f"Mean: {uniform_01.mean():.4f}")  # ≈ 0.5
print(f"Min: {uniform_01.min():.4f}")   # ≈ 0
print(f"Max: {uniform_01.max():.4f}")   # ≈ 1

# Uniform distribution between [low, high)
uniform_custom = rng.uniform(low=10, high=50, size=1000)
print(f"Mean: {uniform_custom.mean():.1f}")  # ≈ 30

Normal Distribution (Gaussian Distribution)

This is one of the most important distributions — it appears everywhere in nature and data:

rng = np.random.default_rng(42)

# Standard normal distribution: mean = 0, standard deviation = 1
standard = rng.standard_normal(10000)
print(f"Mean: {standard.mean():.4f}")  # ≈ 0
print(f"Standard deviation: {standard.std():.4f}")  # ≈ 1

# Normal distribution with a specified mean and standard deviation
# For example: the height of adult men in China is about 170 cm, with a standard deviation of about 6 cm
heights = rng.normal(loc=170, scale=6, size=10000)
print(f"Average height: {heights.mean():.1f} cm")
print(f"Standard deviation: {heights.std():.1f} cm")
print(f"Shortest: {heights.min():.1f} cm")
print(f"Tallest: {heights.max():.1f} cm")

Binomial Distribution

The number of successes in n independent trials (for example, flipping a coin):

rng = np.random.default_rng(42)

# Simulate flipping a coin 10 times (probability of heads = 0.5), repeated 10000 times
results = rng.binomial(n=10, p=0.5, size=10000)
print(f"Average number of heads: {results.mean():.2f}")  # ≈ 5
print(f"Minimum: {results.min()}")
print(f"Maximum: {results.max()}")

Other Common Distributions

rng = np.random.default_rng(42)

# Poisson distribution (number of events)
# For example: 5 customers arrive per hour on average
visitors = rng.poisson(lam=5, size=1000)
print(f"Poisson distribution - Mean: {visitors.mean():.2f}")

# Exponential distribution (time between events)
wait_times = rng.exponential(scale=2.0, size=1000)
print(f"Exponential distribution - Mean: {wait_times.mean():.2f}")

# choice: randomly select from an array
names = np.array(["Alice", "Bob", "Charlie", "Diana", "Eve"])
chosen = rng.choice(names, size=3, replace=False)  # Sampling without replacement
print(f"Random selection: {chosen}")

Random Operations

Random Shuffling

rng = np.random.default_rng(42)

arr = np.arange(10)    # [0 1 2 3 4 5 6 7 8 9]

# Shuffle in place
rng.shuffle(arr)
print(arr)              # [8 1 5 0 7 2 9 4 3 6] (random order)

# Shuffle and return a new array (do not modify the original)
arr2 = np.arange(10)
shuffled = rng.permutation(arr2)
print(arr2)       # [0 1 2 3 4 5 6 7 8 9]  original array unchanged
print(shuffled)   # New shuffled array

Random Sampling

rng = np.random.default_rng(42)

data = np.arange(100)

# Sampling with replacement (duplicates possible)
sample1 = rng.choice(data, size=10, replace=True)
print(f"With replacement: {sample1}")

# Sampling without replacement (no duplicates)
sample2 = rng.choice(data, size=10, replace=False)
print(f"Without replacement: {sample2}")

# Weighted random sampling
items = np.array(["Common", "Uncommon", "Rare", "Legendary"])
weights = np.array([0.6, 0.25, 0.1, 0.05])  # probabilities
drops = rng.choice(items, size=20, p=weights)
unique, counts = np.unique(drops, return_counts=True)
for item, count in zip(unique, counts):
    print(f"  {item}: {count} times")

Statistical Operations

NumPy provides a rich set of statistical functions:

Descriptive Statistics

rng = np.random.default_rng(seed=42)
data = rng.normal(loc=75, scale=10, size=100)  # Scores of 100 students

print("=== Descriptive Statistics ===")
print(f"Mean (mean):     {np.mean(data):.2f}")
print(f"Median (median): {np.median(data):.2f}")
print(f"Standard deviation (std):    {np.std(data):.2f}")
print(f"Variance (var):      {np.var(data):.2f}")
print(f"Minimum (min):    {np.min(data):.2f}")
print(f"Maximum (max):    {np.max(data):.2f}")
print(f"Range (ptp):      {np.ptp(data):.2f}")   # max - min

Percentiles

rng = np.random.default_rng(seed=42)
data = rng.normal(loc=75, scale=10, size=1000)

# Percentiles
print(f"25th percentile: {np.percentile(data, 25):.2f}")
print(f"50th percentile: {np.percentile(data, 50):.2f}")  # = median
print(f"75th percentile: {np.percentile(data, 75):.2f}")
print(f"90th percentile: {np.percentile(data, 90):.2f}")

# Interquartile range (IQR)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
print(f"Interquartile range (IQR): {iqr:.2f}")

Correlation Coefficient

rng = np.random.default_rng(seed=42)

# Height and weight are usually positively correlated
height = rng.normal(170, 8, 100)
weight = height * 0.6 - 30 + rng.normal(0, 5, 100)  # Approximate linear relationship + noise

# Compute the correlation coefficient matrix
corr_matrix = np.corrcoef(height, weight)
print(f"Correlation coefficient: {corr_matrix[0, 1]:.4f}")  # ≈ 0.7~0.9 (positive correlation)

# Interpretation:
# 1.0  = perfect positive correlation
# 0.0  = no correlation
# -1.0 = perfect negative correlation

Histogram Statistics

rng = np.random.default_rng(seed=42)
scores = rng.normal(75, 10, 200)

# Count how many scores fall into each range
bins = [0, 60, 70, 80, 90, 100]
counts, bin_edges = np.histogram(scores, bins=bins)
labels = ["Failing", "Passing", "Average", "Good", "Excellent"]

print("=== Score Distribution ===")
for label, count, left, right in zip(labels, counts, bin_edges[:-1], bin_edges[1:]):
    bar = "█" * count
    print(f"  {label} [{left:.0f}-{right:.0f}): {count:3d} {bar}")

Hands-On: Simulating Monte Carlo

The Monte Carlo method is a classic way to use random numbers to estimate complex problems. Below, we use it to estimate π:

import numpy as np

def estimate_pi(n_points):
    """
    Estimate π by randomly scattering points inside a square
    The proportion of points that fall inside the quarter circle is approximately π/4
    """
    rng = np.random.default_rng(42)

    # Randomly scatter points in the square [0, 1] × [0, 1]
    x = rng.random(n_points)
    y = rng.random(n_points)

    # Compute the distance to the origin
    distance = np.sqrt(x**2 + y**2)

    # Number of points inside the quarter circle (distance <= 1)
    inside = np.sum(distance <= 1)

    # π ≈ 4 × (number of points inside the circle / total number of points)
    pi_estimate = 4 * inside / n_points
    return pi_estimate

# Estimation accuracy with different numbers of points
for n in [100, 1000, 10000, 100000, 1000000]:
    pi_est = estimate_pi(n)
    error = abs(pi_est - np.pi)
    print(f"  {n:>10,} points → π ≈ {pi_est:.6f}  Error: {error:.6f}")

Output:

       100 points → π ≈ 3.120000  Error: 0.021593
     1,000 points → π ≈ 3.156000  Error: 0.014407
    10,000 points → π ≈ 3.153200  Error: 0.011607
   100,000 points → π ≈ 3.140480  Error: 0.001113
 1,000,000 points → π ≈ 3.142484  Error: 0.000891

The more points you use, the more accurate the estimate becomes! That is the charm of the Monte Carlo method.

Summary

Hands-On Exercises

Exercise 1: Simulate Rolling Dice

rng = np.random.default_rng(42)

# Simulate rolling 2 dice 10000 times
# 1. Generate a 10000×2 array of random integers (each row is one roll of two dice)
# 2. Compute the sum of the two dice for each roll
# 3. Count how many times each sum (2~12) appears
# 4. Find the most frequent sum (should be 7)

Exercise 2: Simulate Stock Prices

rng = np.random.default_rng(42)

# Simulate price changes for one stock over 250 trading days
# Initial price: 100
# Daily returns follow a normal distribution: mean 0.05%, standard deviation 2%
initial_price = 100
n_days = 250

# 1. Generate 250 daily returns
# daily_returns = rng.normal(loc=?, scale=?, size=?)

# 2. Compute the price for each day (hint: use np.cumprod)
# prices = initial_price * np.cumprod(1 + daily_returns)

# 3. Compute the final price, highest price, and lowest price
# 4. Compute the annualized return

Exercise 3: Score Analysis

rng = np.random.default_rng(seed=42)

# Generate scores for 200 students
math_scores = rng.normal(75, 12, 200).clip(0, 100)    # Math
english_scores = rng.normal(78, 10, 200).clip(0, 100)  # English

# 1. Compute the mean, standard deviation, and median for each subject
# 2. Compute the correlation coefficient between the two subjects
# 3. Count how many students failed math but passed English
# 4. Use histogram to analyze the score distribution for both subjects
# 5. Compute the average score of the Top 10 students by total score

Chapter Summary: A Complete View of NumPy Knowledge

Congratulations on finishing all the NumPy content! Let's review what you learned in this chapter:

✅ Self-check: Can you use NumPy to create a 100×3 random matrix, compute the mean and standard deviation of each column, and find the column index of the maximum value in each row?

import numpy as np

rng = np.random.default_rng(42)
matrix = rng.normal(loc=50, scale=15, size=(100, 3))

# Column means
print("Column means:", np.mean(matrix, axis=0))

# Column standard deviations
print("Column standard deviations:", np.std(matrix, axis=0))

# Column index of the maximum value in each row
print("Column indices of row maxima:", np.argmax(matrix, axis=1))

If all of this feels easy — congratulations, you are ready to step into the world of Pandas!

Learning Objectives​

Why Do We Need Random Numbers?​

numpy.random Basics​

New API (Recommended)​

Old API (Still Commonly Used)​

Random Seed: Making "Random" Reproducible​

Common Probability Distributions​

Uniform Distribution​

Normal Distribution (Gaussian Distribution)​

Binomial Distribution​

Other Common Distributions​

Random Operations​

Random Shuffling​

Random Sampling​

Statistical Operations​

Descriptive Statistics​

Percentiles​

Correlation Coefficient​

Histogram Statistics​

Hands-On: Simulating Monte Carlo​

Summary​

Hands-On Exercises​

Exercise 1: Simulate Rolling Dice​

Exercise 2: Simulate Stock Prices​

Exercise 3: Score Analysis​

Chapter Summary: A Complete View of NumPy Knowledge​