3.2.7 Random Numbers and Statistics

Learning Objectives
Section titled “Learning Objectives”- Master the common functions in the
numpy.randommodule - Understand common probability distributions (uniform, normal, binomial)
- Understand the role of the random seed
- Learn to use NumPy for basic statistical operations
Why Do We Need Random Numbers?
Section titled “Why Do We Need Random Numbers?”In data science and AI, random numbers are everywhere:
| Scenario | Why random numbers are needed |
|---|---|
| Dataset splitting | Randomly split the training set and test set |
| Model initialization | Neural network weights need random initialization |
| Data augmentation | Randomly crop, rotate, and flip images |
| Monte Carlo simulation | Use random sampling to estimate complex problems |
| A/B testing | Randomly assign users to the control group and the experiment group |
numpy.random Basics
Section titled “numpy.random Basics”New API (Recommended)
Section titled “New API (Recommended)”NumPy recommends using the newer Generator API:
import numpy as np
# Create a random number generatorrng = np.random.default_rng(seed=42)
# Uniform random numbers in [0, 1)print(rng.random(5))# [0.773... 0.438... 0.858... 0.697... 0.094...]
# Random integers in a specified rangeprint(rng.integers(1, 100, size=5))# [67 82 42 91 23] (example values)
# Normal random numbersprint(rng.standard_normal(5))# [-0.15... 0.74... -0.27... ...]Old API (Still Commonly Used)
Section titled “Old API (Still Commonly Used)”You will still see the old API in many tutorials and code examples, so you need to recognize it too:
# Old-style usage (still valid)np.random.seed(42) # Set the global seed
# Uniform random numbers in [0, 1)print(np.random.rand(3))
# Standard normal distributionprint(np.random.randn(3))
# Random integersprint(np.random.randint(1, 100, size=5))Random Seed: Making “Random” Reproducible
Section titled “Random Seed: Making “Random” Reproducible”In scientific research and debugging, we often need “reproducible randomness” — getting the same results every time we run the code.
# No seed: different results every timeprint(np.random.rand(3)) # Different each time
# Set a seed: same results every timenp.random.seed(42)print(np.random.rand(3)) # [0.374... 0.950... 0.731...]
np.random.seed(42) # Reset the same seedprint(np.random.rand(3)) # [0.374... 0.950... 0.731...] Exactly the same!# Seed setting with the new APIrng = np.random.default_rng(seed=42)print(rng.random(3))
rng2 = np.random.default_rng(seed=42) # Same seedprint(rng2.random(3)) # Same resultEvidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Array State
- shape, dtype, axis, and sample values before the operation
- Operation
- indexing, slicing, broadcasting, reshape, linear algebra, or random/stat function
- Output
- resulting array shape, values, or statistic
- Failure Check
- axis confusion, view/copy trap, broadcast mismatch, or wrong shape
- Expected Output
- printed shapes and values that make the array operation inspectable
Common Probability Distributions
Section titled “Common Probability Distributions”Uniform Distribution
Section titled “Uniform Distribution”Each value has the same probability of appearing:
rng = np.random.default_rng(42)
# Uniform distribution between [0, 1)uniform_01 = rng.random(10000)print(f"Mean: {uniform_01.mean():.4f}") # ≈ 0.5print(f"Min: {uniform_01.min():.4f}") # ≈ 0print(f"Max: {uniform_01.max():.4f}") # ≈ 1
# Uniform distribution between [low, high)uniform_custom = rng.uniform(low=10, high=50, size=1000)print(f"Mean: {uniform_custom.mean():.1f}") # ≈ 30Normal Distribution (Gaussian Distribution)
Section titled “Normal Distribution (Gaussian Distribution)”This is one of the most important distributions — it appears everywhere in nature and data:
rng = np.random.default_rng(42)
# Standard normal distribution: mean = 0, standard deviation = 1standard = rng.standard_normal(10000)print(f"Mean: {standard.mean():.4f}") # ≈ 0print(f"Standard deviation: {standard.std():.4f}") # ≈ 1
# Normal distribution with a specified mean and standard deviation# For example: the height of adult men in China is about 170 cm, with a standard deviation of about 6 cmheights = rng.normal(loc=170, scale=6, size=10000)print(f"Average height: {heights.mean():.1f} cm")print(f"Standard deviation: {heights.std():.1f} cm")print(f"Shortest: {heights.min():.1f} cm")print(f"Tallest: {heights.max():.1f} cm")Binomial Distribution
Section titled “Binomial Distribution”The number of successes in n independent trials (for example, flipping a coin):
rng = np.random.default_rng(42)
# Simulate flipping a coin 10 times (probability of heads = 0.5), repeated 10000 timesresults = rng.binomial(n=10, p=0.5, size=10000)print(f"Average number of heads: {results.mean():.2f}") # ≈ 5print(f"Minimum: {results.min()}")print(f"Maximum: {results.max()}")Other Common Distributions
Section titled “Other Common Distributions”rng = np.random.default_rng(42)
# Poisson distribution (number of events)# For example: 5 customers arrive per hour on averagevisitors = rng.poisson(lam=5, size=1000)print(f"Poisson distribution - Mean: {visitors.mean():.2f}")
# Exponential distribution (time between events)wait_times = rng.exponential(scale=2.0, size=1000)print(f"Exponential distribution - Mean: {wait_times.mean():.2f}")
# choice: randomly select from an arraynames = np.array(["Alice", "Bob", "Charlie", "Diana", "Eve"])chosen = rng.choice(names, size=3, replace=False) # Sampling without replacementprint(f"Random selection: {chosen}")Random Operations
Section titled “Random Operations”Random Shuffling
Section titled “Random Shuffling”rng = np.random.default_rng(42)
arr = np.arange(10) # [0 1 2 3 4 5 6 7 8 9]
# Shuffle in placerng.shuffle(arr)print(arr) # [8 1 5 0 7 2 9 4 3 6] (random order)
# Shuffle and return a new array (do not modify the original)arr2 = np.arange(10)shuffled = rng.permutation(arr2)print(arr2) # [0 1 2 3 4 5 6 7 8 9] original array unchangedprint(shuffled) # New shuffled arrayRandom Sampling
Section titled “Random Sampling”rng = np.random.default_rng(42)
data = np.arange(100)
# Sampling with replacement (duplicates possible)sample1 = rng.choice(data, size=10, replace=True)print(f"With replacement: {sample1}")
# Sampling without replacement (no duplicates)sample2 = rng.choice(data, size=10, replace=False)print(f"Without replacement: {sample2}")
# Weighted random samplingitems = np.array(["Common", "Uncommon", "Rare", "Legendary"])weights = np.array([0.6, 0.25, 0.1, 0.05]) # probabilitiesdrops = rng.choice(items, size=20, p=weights)unique, counts = np.unique(drops, return_counts=True)for item, count in zip(unique, counts): print(f" {item}: {count} times")Statistical Operations
Section titled “Statistical Operations”NumPy provides a rich set of statistical functions:
Descriptive Statistics
Section titled “Descriptive Statistics”rng = np.random.default_rng(seed=42)data = rng.normal(loc=75, scale=10, size=100) # Scores of 100 students
print("=== Descriptive Statistics ===")print(f"Mean (mean): {np.mean(data):.2f}")print(f"Median (median): {np.median(data):.2f}")print(f"Standard deviation (std): {np.std(data):.2f}")print(f"Variance (var): {np.var(data):.2f}")print(f"Minimum (min): {np.min(data):.2f}")print(f"Maximum (max): {np.max(data):.2f}")print(f"Range (ptp): {np.ptp(data):.2f}") # max - minPercentiles
Section titled “Percentiles”rng = np.random.default_rng(seed=42)data = rng.normal(loc=75, scale=10, size=1000)
# Percentilesprint(f"25th percentile: {np.percentile(data, 25):.2f}")print(f"50th percentile: {np.percentile(data, 50):.2f}") # = medianprint(f"75th percentile: {np.percentile(data, 75):.2f}")print(f"90th percentile: {np.percentile(data, 90):.2f}")
# Interquartile range (IQR)q1 = np.percentile(data, 25)q3 = np.percentile(data, 75)iqr = q3 - q1print(f"Interquartile range (IQR): {iqr:.2f}")Correlation Coefficient
Section titled “Correlation Coefficient”rng = np.random.default_rng(seed=42)
# Height and weight are usually positively correlatedheight = rng.normal(170, 8, 100)weight = height * 0.6 - 30 + rng.normal(0, 5, 100) # Approximate linear relationship + noise
# Compute the correlation coefficient matrixcorr_matrix = np.corrcoef(height, weight)print(f"Correlation coefficient: {corr_matrix[0, 1]:.4f}") # ≈ 0.7~0.9 (positive correlation)
# Interpretation:# 1.0 = perfect positive correlation# 0.0 = no correlation# -1.0 = perfect negative correlationHistogram Statistics
Section titled “Histogram Statistics”rng = np.random.default_rng(seed=42)scores = rng.normal(75, 10, 200)
# Count how many scores fall into each rangebins = [0, 60, 70, 80, 90, 100]counts, bin_edges = np.histogram(scores, bins=bins)labels = ["Failing", "Passing", "Average", "Good", "Excellent"]
print("=== Score Distribution ===")for label, count, left, right in zip(labels, counts, bin_edges[:-1], bin_edges[1:]): bar = "█" * count print(f" {label} [{left:.0f}-{right:.0f}): {count:3d} {bar}")Hands-On: Simulating Monte Carlo
Section titled “Hands-On: Simulating Monte Carlo”The Monte Carlo method is a classic way to use random numbers to estimate complex problems. Below, we use it to estimate π:
import numpy as np
def estimate_pi(n_points): """ Estimate π by randomly scattering points inside a square The proportion of points that fall inside the quarter circle is approximately π/4 """ rng = np.random.default_rng(42)
# Randomly scatter points in the square [0, 1] × [0, 1] x = rng.random(n_points) y = rng.random(n_points)
# Compute the distance to the origin distance = np.sqrt(x**2 + y**2)
# Number of points inside the quarter circle (distance <= 1) inside = np.sum(distance <= 1)
# π ≈ 4 × (number of points inside the circle / total number of points) pi_estimate = 4 * inside / n_points return pi_estimate
# Estimation accuracy with different numbers of pointsfor n in [100, 1000, 10000, 100000, 1000000]: pi_est = estimate_pi(n) error = abs(pi_est - np.pi) print(f" {n:>10,} points → π ≈ {pi_est:.6f} Error: {error:.6f}")Output summary:
| Points | Estimated π | Error |
|---|---|---|
| 100 | 3.120000 | 0.021593 |
| 1,000 | 3.156000 | 0.014407 |
| 10,000 | 3.153200 | 0.011607 |
| 100,000 | 3.140480 | 0.001113 |
| 1,000,000 | 3.142484 | 0.000891 |
The more points you use, the more accurate the estimate becomes! That is the charm of the Monte Carlo method.
Summary
Section titled “Summary” root((Random Numbers and Statistics)) Random Number Generation random / rand Uniform distribution normal / randn Normal distribution integers / randint Random integers choice Random selection shuffle / permutation Shuffle Random Seed seed Ensures reproducibility default_rng New API Probability Distributions Uniform distribution uniform Normal distribution normal Binomial distribution binomial Poisson distribution poisson Statistical Functions mean / median / std / var min / max / argmin / argmax percentile Percentile corrcoef Correlation coefficient histogram HistogramHands-On Exercises
Section titled “Hands-On Exercises”Exercise 1: Simulate Rolling Dice
Section titled “Exercise 1: Simulate Rolling Dice”rng = np.random.default_rng(42)
# Simulate rolling 2 dice 10000 times# 1. Generate a 10000×2 array of random integers (each row is one roll of two dice)# 2. Compute the sum of the two dice for each roll# 3. Count how many times each sum (2~12) appears# 4. Find the most frequent sum (should be 7)Exercise 2: Simulate Stock Prices
Section titled “Exercise 2: Simulate Stock Prices”rng = np.random.default_rng(42)
# Simulate price changes for one stock over 250 trading days# Initial price: 100# Daily returns follow a normal distribution: mean 0.05%, standard deviation 2%initial_price = 100n_days = 250
# 1. Generate 250 daily returns# daily_returns = rng.normal(loc=?, scale=?, size=?)
# 2. Compute the price for each day (hint: use np.cumprod)# prices = initial_price * np.cumprod(1 + daily_returns)
# 3. Compute the final price, highest price, and lowest price# 4. Compute the annualized returnExercise 3: Batch Metrics Analysis
Section titled “Exercise 3: Batch Metrics Analysis”rng = np.random.default_rng(seed=42)
# Generate metrics for 200 inference batchesaccuracy = rng.normal(0.78, 0.08, 200).clip(0, 1)latency_ms = rng.normal(180, 35, 200).clip(40, None)
# 1. Compute the mean, standard deviation, and median for each metric# 2. Compute the correlation coefficient between accuracy and latency# 3. Count how many batches have accuracy < 0.7 but latency < 160 ms# 4. Use histograms to analyze both distributions# 5. Compute the average latency of the top 10 batches by accuracyReference implementation and walkthrough
- For dice simulation, create a
(10000, 2)array, sum alongaxis=1, and count sums withnp.bincount. The most frequent sum should usually be7because it has the most combinations. - For stock simulation, generate daily returns, then compute prices with
100 * np.cumprod(1 + returns). Report the final return, maximum drawdown if asked, and a chart so the path is visible. - For batch metrics, use
mean,std,corrcoef, boolean filters, histograms, and top-k sorting. Always describe the random seed so someone else can reproduce the same sample.
Chapter Summary: A Complete View of NumPy Knowledge
Section titled “Chapter Summary: A Complete View of NumPy Knowledge”Congratulations on finishing all the NumPy content! Let’s review what you learned in this chapter:
flowchart TB subgraph "Chapter 1 NumPy Scientific Computing" A["1.1 Overview<br/>Why use NumPy"] --> B["1.2 Array Basics<br/>Creation + Attributes + dtype"] B --> C["1.3 Indexing and Slicing<br/>Basic / Boolean / Fancy Indexing"] C --> D["1.4 Array Operations<br/>Vectorization + Broadcasting + Aggregation"] D --> E["1.5 Reshaping Operations<br/>reshape + Concatenation + Splitting"] E --> F["1.6 Linear Algebra<br/>Matrix Multiplication + Inverse + Eigenvalues"] F --> G["1.7 Random Numbers and Statistics<br/>Distributions + Sampling + Statistics"] end
G --> H["✅ You now master the core of NumPy!<br/>Ready to move on to Pandas →"]
style H fill:#4caf50,color:#fff✅ Self-check: Can you use NumPy to create a 100×3 random matrix, compute the mean and standard deviation of each column, and find the column index of the maximum value in each row?
import numpy as np
rng = np.random.default_rng(42)matrix = rng.normal(loc=50, scale=15, size=(100, 3))
# Column meansprint("Column means:", np.mean(matrix, axis=0))
# Column standard deviationsprint("Column standard deviations:", np.std(matrix, axis=0))
# Column index of the maximum value in each rowprint("Column indices of row maxima:", np.argmax(matrix, axis=1))If all of this feels easy — congratulations, you are ready to step into the world of Pandas!