Skip to content

3.2.7 Random Numbers and Statistics

NumPy Random Numbers and Statistics Map

  • Master the common functions in the numpy.random module
  • Understand common probability distributions (uniform, normal, binomial)
  • Understand the role of the random seed
  • Learn to use NumPy for basic statistical operations

In data science and AI, random numbers are everywhere:

ScenarioWhy random numbers are needed
Dataset splittingRandomly split the training set and test set
Model initializationNeural network weights need random initialization
Data augmentationRandomly crop, rotate, and flip images
Monte Carlo simulationUse random sampling to estimate complex problems
A/B testingRandomly assign users to the control group and the experiment group

NumPy recommends using the newer Generator API:

import numpy as np
# Create a random number generator
rng = np.random.default_rng(seed=42)
# Uniform random numbers in [0, 1)
print(rng.random(5))
# [0.773... 0.438... 0.858... 0.697... 0.094...]
# Random integers in a specified range
print(rng.integers(1, 100, size=5))
# [67 82 42 91 23] (example values)
# Normal random numbers
print(rng.standard_normal(5))
# [-0.15... 0.74... -0.27... ...]

You will still see the old API in many tutorials and code examples, so you need to recognize it too:

# Old-style usage (still valid)
np.random.seed(42) # Set the global seed
# Uniform random numbers in [0, 1)
print(np.random.rand(3))
# Standard normal distribution
print(np.random.randn(3))
# Random integers
print(np.random.randint(1, 100, size=5))

Random Seed: Making “Random” Reproducible

Section titled “Random Seed: Making “Random” Reproducible”

In scientific research and debugging, we often need “reproducible randomness” — getting the same results every time we run the code.

# No seed: different results every time
print(np.random.rand(3)) # Different each time
# Set a seed: same results every time
np.random.seed(42)
print(np.random.rand(3)) # [0.374... 0.950... 0.731...]
np.random.seed(42) # Reset the same seed
print(np.random.rand(3)) # [0.374... 0.950... 0.731...] Exactly the same!
# Seed setting with the new API
rng = np.random.default_rng(seed=42)
print(rng.random(3))
rng2 = np.random.default_rng(seed=42) # Same seed
print(rng2.random(3)) # Same result

Keep this page’s proof of learning as a small evidence card:

Array State
shape, dtype, axis, and sample values before the operation
Operation
indexing, slicing, broadcasting, reshape, linear algebra, or random/stat function
Output
resulting array shape, values, or statistic
Failure Check
axis confusion, view/copy trap, broadcast mismatch, or wrong shape
Expected Output
printed shapes and values that make the array operation inspectable

Each value has the same probability of appearing:

rng = np.random.default_rng(42)
# Uniform distribution between [0, 1)
uniform_01 = rng.random(10000)
print(f"Mean: {uniform_01.mean():.4f}") # ≈ 0.5
print(f"Min: {uniform_01.min():.4f}") # ≈ 0
print(f"Max: {uniform_01.max():.4f}") # ≈ 1
# Uniform distribution between [low, high)
uniform_custom = rng.uniform(low=10, high=50, size=1000)
print(f"Mean: {uniform_custom.mean():.1f}") # ≈ 30

Normal Distribution (Gaussian Distribution)

Section titled “Normal Distribution (Gaussian Distribution)”

This is one of the most important distributions — it appears everywhere in nature and data:

rng = np.random.default_rng(42)
# Standard normal distribution: mean = 0, standard deviation = 1
standard = rng.standard_normal(10000)
print(f"Mean: {standard.mean():.4f}") # ≈ 0
print(f"Standard deviation: {standard.std():.4f}") # ≈ 1
# Normal distribution with a specified mean and standard deviation
# For example: the height of adult men in China is about 170 cm, with a standard deviation of about 6 cm
heights = rng.normal(loc=170, scale=6, size=10000)
print(f"Average height: {heights.mean():.1f} cm")
print(f"Standard deviation: {heights.std():.1f} cm")
print(f"Shortest: {heights.min():.1f} cm")
print(f"Tallest: {heights.max():.1f} cm")

The number of successes in n independent trials (for example, flipping a coin):

rng = np.random.default_rng(42)
# Simulate flipping a coin 10 times (probability of heads = 0.5), repeated 10000 times
results = rng.binomial(n=10, p=0.5, size=10000)
print(f"Average number of heads: {results.mean():.2f}") # ≈ 5
print(f"Minimum: {results.min()}")
print(f"Maximum: {results.max()}")
rng = np.random.default_rng(42)
# Poisson distribution (number of events)
# For example: 5 customers arrive per hour on average
visitors = rng.poisson(lam=5, size=1000)
print(f"Poisson distribution - Mean: {visitors.mean():.2f}")
# Exponential distribution (time between events)
wait_times = rng.exponential(scale=2.0, size=1000)
print(f"Exponential distribution - Mean: {wait_times.mean():.2f}")
# choice: randomly select from an array
names = np.array(["Alice", "Bob", "Charlie", "Diana", "Eve"])
chosen = rng.choice(names, size=3, replace=False) # Sampling without replacement
print(f"Random selection: {chosen}")

rng = np.random.default_rng(42)
arr = np.arange(10) # [0 1 2 3 4 5 6 7 8 9]
# Shuffle in place
rng.shuffle(arr)
print(arr) # [8 1 5 0 7 2 9 4 3 6] (random order)
# Shuffle and return a new array (do not modify the original)
arr2 = np.arange(10)
shuffled = rng.permutation(arr2)
print(arr2) # [0 1 2 3 4 5 6 7 8 9] original array unchanged
print(shuffled) # New shuffled array
rng = np.random.default_rng(42)
data = np.arange(100)
# Sampling with replacement (duplicates possible)
sample1 = rng.choice(data, size=10, replace=True)
print(f"With replacement: {sample1}")
# Sampling without replacement (no duplicates)
sample2 = rng.choice(data, size=10, replace=False)
print(f"Without replacement: {sample2}")
# Weighted random sampling
items = np.array(["Common", "Uncommon", "Rare", "Legendary"])
weights = np.array([0.6, 0.25, 0.1, 0.05]) # probabilities
drops = rng.choice(items, size=20, p=weights)
unique, counts = np.unique(drops, return_counts=True)
for item, count in zip(unique, counts):
print(f" {item}: {count} times")

NumPy provides a rich set of statistical functions:

rng = np.random.default_rng(seed=42)
data = rng.normal(loc=75, scale=10, size=100) # Scores of 100 students
print("=== Descriptive Statistics ===")
print(f"Mean (mean): {np.mean(data):.2f}")
print(f"Median (median): {np.median(data):.2f}")
print(f"Standard deviation (std): {np.std(data):.2f}")
print(f"Variance (var): {np.var(data):.2f}")
print(f"Minimum (min): {np.min(data):.2f}")
print(f"Maximum (max): {np.max(data):.2f}")
print(f"Range (ptp): {np.ptp(data):.2f}") # max - min
rng = np.random.default_rng(seed=42)
data = rng.normal(loc=75, scale=10, size=1000)
# Percentiles
print(f"25th percentile: {np.percentile(data, 25):.2f}")
print(f"50th percentile: {np.percentile(data, 50):.2f}") # = median
print(f"75th percentile: {np.percentile(data, 75):.2f}")
print(f"90th percentile: {np.percentile(data, 90):.2f}")
# Interquartile range (IQR)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
print(f"Interquartile range (IQR): {iqr:.2f}")
rng = np.random.default_rng(seed=42)
# Height and weight are usually positively correlated
height = rng.normal(170, 8, 100)
weight = height * 0.6 - 30 + rng.normal(0, 5, 100) # Approximate linear relationship + noise
# Compute the correlation coefficient matrix
corr_matrix = np.corrcoef(height, weight)
print(f"Correlation coefficient: {corr_matrix[0, 1]:.4f}") # ≈ 0.7~0.9 (positive correlation)
# Interpretation:
# 1.0 = perfect positive correlation
# 0.0 = no correlation
# -1.0 = perfect negative correlation
rng = np.random.default_rng(seed=42)
scores = rng.normal(75, 10, 200)
# Count how many scores fall into each range
bins = [0, 60, 70, 80, 90, 100]
counts, bin_edges = np.histogram(scores, bins=bins)
labels = ["Failing", "Passing", "Average", "Good", "Excellent"]
print("=== Score Distribution ===")
for label, count, left, right in zip(labels, counts, bin_edges[:-1], bin_edges[1:]):
bar = "" * count
print(f" {label} [{left:.0f}-{right:.0f}): {count:3d} {bar}")

The Monte Carlo method is a classic way to use random numbers to estimate complex problems. Below, we use it to estimate π:

import numpy as np
def estimate_pi(n_points):
"""
Estimate π by randomly scattering points inside a square
The proportion of points that fall inside the quarter circle is approximately π/4
"""
rng = np.random.default_rng(42)
# Randomly scatter points in the square [0, 1] × [0, 1]
x = rng.random(n_points)
y = rng.random(n_points)
# Compute the distance to the origin
distance = np.sqrt(x**2 + y**2)
# Number of points inside the quarter circle (distance <= 1)
inside = np.sum(distance <= 1)
# π ≈ 4 × (number of points inside the circle / total number of points)
pi_estimate = 4 * inside / n_points
return pi_estimate
# Estimation accuracy with different numbers of points
for n in [100, 1000, 10000, 100000, 1000000]:
pi_est = estimate_pi(n)
error = abs(pi_est - np.pi)
print(f" {n:>10,} points → π ≈ {pi_est:.6f} Error: {error:.6f}")

Output summary:

PointsEstimated πError
1003.1200000.021593
1,0003.1560000.014407
10,0003.1532000.011607
100,0003.1404800.001113
1,000,0003.1424840.000891

The more points you use, the more accurate the estimate becomes! That is the charm of the Monte Carlo method.


root((Random Numbers and Statistics))
Random Number Generation
random / rand Uniform distribution
normal / randn Normal distribution
integers / randint Random integers
choice Random selection
shuffle / permutation Shuffle
Random Seed
seed Ensures reproducibility
default_rng New API
Probability Distributions
Uniform distribution uniform
Normal distribution normal
Binomial distribution binomial
Poisson distribution poisson
Statistical Functions
mean / median / std / var
min / max / argmin / argmax
percentile Percentile
corrcoef Correlation coefficient
histogram Histogram

rng = np.random.default_rng(42)
# Simulate rolling 2 dice 10000 times
# 1. Generate a 10000×2 array of random integers (each row is one roll of two dice)
# 2. Compute the sum of the two dice for each roll
# 3. Count how many times each sum (2~12) appears
# 4. Find the most frequent sum (should be 7)
rng = np.random.default_rng(42)
# Simulate price changes for one stock over 250 trading days
# Initial price: 100
# Daily returns follow a normal distribution: mean 0.05%, standard deviation 2%
initial_price = 100
n_days = 250
# 1. Generate 250 daily returns
# daily_returns = rng.normal(loc=?, scale=?, size=?)
# 2. Compute the price for each day (hint: use np.cumprod)
# prices = initial_price * np.cumprod(1 + daily_returns)
# 3. Compute the final price, highest price, and lowest price
# 4. Compute the annualized return
rng = np.random.default_rng(seed=42)
# Generate metrics for 200 inference batches
accuracy = rng.normal(0.78, 0.08, 200).clip(0, 1)
latency_ms = rng.normal(180, 35, 200).clip(40, None)
# 1. Compute the mean, standard deviation, and median for each metric
# 2. Compute the correlation coefficient between accuracy and latency
# 3. Count how many batches have accuracy < 0.7 but latency < 160 ms
# 4. Use histograms to analyze both distributions
# 5. Compute the average latency of the top 10 batches by accuracy

Reference implementation and walkthrough
  • For dice simulation, create a (10000, 2) array, sum along axis=1, and count sums with np.bincount. The most frequent sum should usually be 7 because it has the most combinations.
  • For stock simulation, generate daily returns, then compute prices with 100 * np.cumprod(1 + returns). Report the final return, maximum drawdown if asked, and a chart so the path is visible.
  • For batch metrics, use mean, std, corrcoef, boolean filters, histograms, and top-k sorting. Always describe the random seed so someone else can reproduce the same sample.

Chapter Summary: A Complete View of NumPy Knowledge

Section titled “Chapter Summary: A Complete View of NumPy Knowledge”

Congratulations on finishing all the NumPy content! Let’s review what you learned in this chapter:

flowchart TB
subgraph "Chapter 1 NumPy Scientific Computing"
A["1.1 Overview<br/>Why use NumPy"] --> B["1.2 Array Basics<br/>Creation + Attributes + dtype"]
B --> C["1.3 Indexing and Slicing<br/>Basic / Boolean / Fancy Indexing"]
C --> D["1.4 Array Operations<br/>Vectorization + Broadcasting + Aggregation"]
D --> E["1.5 Reshaping Operations<br/>reshape + Concatenation + Splitting"]
E --> F["1.6 Linear Algebra<br/>Matrix Multiplication + Inverse + Eigenvalues"]
F --> G["1.7 Random Numbers and Statistics<br/>Distributions + Sampling + Statistics"]
end
G --> H["✅ You now master the core of NumPy!<br/>Ready to move on to Pandas →"]
style H fill:#4caf50,color:#fff

✅ Self-check: Can you use NumPy to create a 100×3 random matrix, compute the mean and standard deviation of each column, and find the column index of the maximum value in each row?

import numpy as np
rng = np.random.default_rng(42)
matrix = rng.normal(loc=50, scale=15, size=(100, 3))
# Column means
print("Column means:", np.mean(matrix, axis=0))
# Column standard deviations
print("Column standard deviations:", np.std(matrix, axis=0))
# Column index of the maximum value in each row
print("Column indices of row maxima:", np.argmax(matrix, axis=1))

If all of this feels easy — congratulations, you are ready to step into the world of Pandas!