4.2.3 Probability Distributions: Patterns Behind the Data

Probability distribution family comparison chart

Learning Objectives

Understand what a probability distribution is
Master common discrete distributions (Bernoulli, binomial, Poisson)
Master common continuous distributions (uniform, normal/Gaussian)
Build an intuitive understanding of the Central Limit Theorem — why the normal distribution appears everywhere
Use Python to generate and visualize different distributions

Terms to Decode Before Plotting

This lesson introduces several distribution words that look compact but carry a lot of meaning:

Term	Full name / meaning	Beginner-friendly interpretation
`random variable`	A variable whose value is uncertain	The thing we observe, such as clicks, height, dice result, or number of customers
`PMF`	Probability Mass Function	For discrete values, how much probability is assigned to each value
`PDF`	Probability Density Function	For continuous values, where probability density is high or low
`CDF`	Cumulative Distribution Function	Probability that the value is less than or equal to a threshold
`μ` / `mu`	Mean	The center or average location of a distribution
`σ` / `sigma`	Standard deviation	How spread out the distribution is
`λ` / `lambda_`	Rate or average count	In Poisson, the average number of events in a fixed interval
`SciPy stats`	Statistical functions in SciPy	A Python toolbox for probability distributions, PMF, PDF, and CDF

If you run this file locally, install the three libraries used by the examples:

python3 -m pip install numpy matplotlib scipy

The examples use lambda_ instead of lambda because lambda is a Python keyword for anonymous functions.

First, a very important learning expectation

This section is not meant to turn every distribution into an "exam cheat sheet." Instead, it is meant to help you build one especially important intuition:

Probability basics focus on a single event
Probability distributions focus on what the whole random phenomenon looks like

First, build a map

If the previous section was about "the probability of a single event," then this section is about:

What does an entire random phenomenon look like?

Map of random phenomena for probability distributions

The key point of this lesson is not to memorize every distribution, but to first know:

When a certain distribution appears
What it roughly looks like
Why you keep running into it in AI

What is a probability distribution?

A probability distribution = all possible values of a random variable and the probability of each value.

A more beginner-friendly analogy

If probability is like "whether something will happen this time," then a distribution is more like:

A long-term, statistically estimated "map of possibilities"

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

stats is the distribution module from SciPy. In this lesson, it saves us from manually writing formulas for binomial, Poisson, and normal distributions, so we can focus on intuition.

Discrete distributions

Bernoulli distribution — only two outcomes

You perform one experiment only, and the result is either "success" (1) or "failure" (0).

# Bernoulli distribution: flipping a coin once
# p = probability of success
p = 0.6  # unfair coin, 60% chance of heads

# Simulate 10000 times
rng = np.random.default_rng(seed=42)
samples = rng.binomial(1, p, 10000)
print(f"Proportion of heads: {samples.mean():.3f}")  # ≈ 0.6

fig, ax = plt.subplots(figsize=(6, 4))
values, counts = np.unique(samples, return_counts=True)
ax.bar(['Tails (0)', 'Heads (1)'], counts / len(samples),
       color=['coral', 'steelblue'], edgecolor='white')
ax.set_ylabel('Probability')
ax.set_title(f'Bernoulli Distribution (p={p})')
ax.set_ylim(0, 1)
plt.show()

Expected output with seed=42:

Proportion of heads: 0.605

Application in AI: labels for binary classification tasks follow a Bernoulli distribution (0 or 1).

Binomial distribution — the sum of multiple Bernoulli trials

The total number of successes after n Bernoulli trials follows a binomial distribution.

# Binomial distribution: flipping a coin 20 times, counting heads
n = 20   # number of trials
p = 0.5  # probability of success each time

# Theoretical distribution
x = np.arange(0, n + 1)
pmf = stats.binom.pmf(x, n, p)

# Simulation
rng = np.random.default_rng(seed=42)
samples = rng.binomial(n, p, 10000)
print(f"Expected heads n*p: {n*p:.1f}")
print(f"Most likely number of heads: {x[pmf.argmax()]}")
print(f"Simulated mean: {samples.mean():.3f}")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Theory
axes[0].bar(x, pmf, color='steelblue', edgecolor='white')
axes[0].set_xlabel('Number of heads')
axes[0].set_ylabel('Probability')
axes[0].set_title(f'Binomial Distribution B(n={n}, p={p}) (theoretical)')

# Simulation
axes[1].hist(samples, bins=range(n+2), density=True, color='coral', edgecolor='white', alpha=0.7)
axes[1].set_xlabel('Number of heads')
axes[1].set_ylabel('Frequency')
axes[1].set_title(f'Binomial Distribution B(n={n}, p={p}) (10,000 simulations)')

plt.tight_layout()
plt.show()

Expected output with seed=42:

Expected heads n*p: 10.0
Most likely number of heads: 10
Simulated mean: 9.984

Key parameters:

Mean = n × p (if you flip a fair coin 20 times, the expected number of heads is 10)
Variance = n × p × (1-p)

Poisson distribution — counting "rare events"

This is the number of times a rare event occurs in a fixed amount of time or space.

# Poisson distribution: a milk tea shop gets an average of 5 customers per hour
lambda_ = 5  # average value (λ)

x = np.arange(0, 20)
pmf = stats.poisson.pmf(x, lambda_)

fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(x, pmf, color='mediumseagreen', edgecolor='white')
ax.set_xlabel('Number of customers per hour')
ax.set_ylabel('Probability')
ax.set_title(f'Poisson Distribution Poisson(λ={lambda_})')
ax.set_xticks(x)
plt.show()

print(f"Probability of 0 customers: {stats.poisson.pmf(0, lambda_):.4f}")
print(f"Probability of 5 customers: {stats.poisson.pmf(5, lambda_):.4f}")
print(f"Probability of 10+ customers: {1 - stats.poisson.cdf(9, lambda_):.4f}")

Expected output:

Probability of 0 customers: 0.0067
Probability of 5 customers: 0.1755
Probability of 10+ customers: 0.0318

Application in AI: the number of rare words in a text, website traffic volume, anomaly detection.

Continuous distributions

Uniform distribution — completely random

Every value has exactly the same probability of occurring.

# Uniform distribution U(0, 1)
rng = np.random.default_rng(seed=42)
samples = rng.uniform(0, 1, 10000)
print(f"Uniform sample mean: {samples.mean():.3f}")
print(f"Uniform sample min/max: {samples.min():.3f}/{samples.max():.3f}")

fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(samples, bins=50, density=True, color='steelblue', edgecolor='white', alpha=0.7)
ax.axhline(y=1, color='red', linestyle='--', label='Theoretical density = 1')
ax.set_xlabel('Value')
ax.set_ylabel('Probability density')
ax.set_title('Uniform Distribution U(0, 1)')
ax.legend()
plt.show()

Expected output with seed=42:

Uniform sample mean: 0.497
Uniform sample min/max: 0.000/1.000

Application in AI: random weight initialization, random sampling, random transformations in data augmentation.

Normal distribution (Gaussian distribution) — the most important distribution

Normal distribution is often called a Gaussian distribution. stats.norm.pdf(x, mu, sigma) returns the height of the bell curve at x. For continuous distributions, the height itself is not a probability; the probability is the area under the curve across an interval.

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Different means
x = np.linspace(-8, 12, 1000)
for mu in [-2, 0, 3, 5]:
    axes[0].plot(x, stats.norm.pdf(x, mu, 1), linewidth=2, label=f'μ={mu}, σ=1')
axes[0].set_title('Different means μ (different center positions)')
axes[0].legend()
axes[0].set_xlabel('x')
axes[0].set_ylabel('Probability density')

# Different standard deviations
for sigma in [0.5, 1, 2, 4]:
    axes[1].plot(x, stats.norm.pdf(x, 0, sigma), linewidth=2, label=f'μ=0, σ={sigma}')
axes[1].set_title('Different standard deviations σ (different widths)')
axes[1].legend()
axes[1].set_xlabel('x')
axes[1].set_ylabel('Probability density')

plt.tight_layout()
plt.show()

The 68-95-99.7 rule

The normal distribution has a very useful rule:

mu, sigma = 0, 1

print("68-95-99.7 rule:")
for k, pct in [(1, '68.3%'), (2, '95.4%'), (3, '99.7%')]:
    area = stats.norm.cdf(mu + k*sigma) - stats.norm.cdf(mu - k*sigma)
    print(f"  Within μ ± {k}σ: {area:.1%} of the data (theoretical {pct})")

Expected output:

68-95-99.7 rule:
  Within μ ± 1σ: 68.3% of the data (theoretical 68.3%)
  Within μ ± 2σ: 95.4% of the data (theoretical 95.4%)
  Within μ ± 3σ: 99.7% of the data (theoretical 99.7%)

# Visualize 68-95-99.7
fig, ax = plt.subplots(figsize=(10, 5))
x = np.linspace(-4, 4, 1000)
y = stats.norm.pdf(x)

ax.plot(x, y, 'k-', linewidth=2)

# Fill regions
colors = ['steelblue', 'cornflowerblue', 'lightblue']
labels = ['68.3% (±1σ)', '95.4% (±2σ)', '99.7% (±3σ)']
for k, color, label in zip([3, 2, 1], colors[::-1], labels[::-1]):
    mask = (x >= -k) & (x <= k)
    ax.fill_between(x[mask], y[mask], alpha=0.5, color=color, label=label)

ax.set_xlabel('Standard deviations')
ax.set_ylabel('Probability density')
ax.set_title('The 68-95-99.7 Rule of the Normal Distribution')
ax.legend(loc='upper right')
plt.show()

Applications of the normal distribution in AI

Use case	Description
Weight initialization	Neural network weights are often initialized with a normal distribution (such as He initialization and Xavier initialization)
Data standardization	Convert data to a "standard normal" distribution with mean 0 and standard deviation 1
Noise modeling	Sensor noise and measurement error are often assumed to follow a normal distribution
Generative models	VAE and diffusion models sample from a normal distribution to generate new data
Anomaly detection	Data points more than 3σ away from the mean may be outliers

The Central Limit Theorem — the most important theorem

Core idea

No matter what the original data distribution is, the average of a large number of independent samples tends toward a normal distribution.

This is why the normal distribution appears everywhere in nature and data science — many phenomena are essentially the combined effect of many independent factors.

Verify it with code

fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Three completely different original distributions
rng = np.random.default_rng(seed=42)
distributions = [
    ('Uniform distribution', lambda n: rng.uniform(0, 1, n)),
    ('Exponential distribution', lambda n: rng.exponential(1, n)),
    ('Binomial distribution', lambda n: rng.binomial(10, 0.3, n)),
]

for col, (name, dist_func) in enumerate(distributions):
    # Top: original distribution
    samples = dist_func(10000)
    axes[0, col].hist(samples, bins=50, density=True, color='coral',
                       edgecolor='white', alpha=0.7)
    axes[0, col].set_title(f'Original distribution: {name}')
    axes[0, col].set_ylabel('Probability density')

    # Bottom: take the average of 30 samples, repeat 10000 times
    n_samples = 30
    means = np.array([dist_func(n_samples).mean() for _ in range(10000)])

    axes[1, col].hist(means, bins=50, density=True, color='steelblue',
                       edgecolor='white', alpha=0.7)

    # Overlay a normal distribution curve
    x = np.linspace(means.min(), means.max(), 100)
    axes[1, col].plot(x, stats.norm.pdf(x, means.mean(), means.std()),
                       'r-', linewidth=2, label='Normal fit')
    axes[1, col].set_title(f'Distribution of sample means (n={n_samples})')
    axes[1, col].set_ylabel('Probability density')
    axes[1, col].legend()
    print(f"{name}: mean of sample means={means.mean():.3f}, std={means.std():.3f}")

plt.suptitle('Central Limit Theorem: No matter what the original distribution is, sample means tend toward a normal distribution',
             fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

Expected output with seed=42:

Uniform distribution: mean of sample means=0.500, std=0.053
Exponential distribution: mean of sample means=0.999, std=0.182
Binomial distribution: mean of sample means=3.005, std=0.262

Interpretation: No matter whether the original data is uniform, skewed, or discrete, as long as you take the average of enough samples, the distribution will become normal.

The effect of sample size

fig, axes = plt.subplots(1, 4, figsize=(18, 4))

# Use the exponential distribution (highly skewed) for the experiment
rng = np.random.default_rng(seed=42)
for ax, n in zip(axes, [1, 5, 30, 100]):
    means = [rng.exponential(1, n).mean() for _ in range(10000)]
    ax.hist(means, bins=50, density=True, color='steelblue', edgecolor='white', alpha=0.7)

    x = np.linspace(min(means), max(means), 100)
    ax.plot(x, stats.norm.pdf(x, np.mean(means), np.std(means)), 'r-', linewidth=2)
    ax.set_title(f'n = {n}')
    ax.set_xlabel('Sample mean')

plt.suptitle('The larger the sample size, the closer the mean distribution is to normal', fontsize=13)
plt.tight_layout()
plt.show()

Rule of thumb

Usually when n ≥ 30, the Central Limit Theorem works quite well. That is why many statistical methods require a "sample size of at least 30."

Distribution overview table

Distribution	Type	Parameters	Typical scenario	NumPy generation
Bernoulli	Discrete	p (success probability)	Binary classification labels	`rng.binomial(1, p)`
Binomial	Discrete	n, p	Number of successes in n trials	`rng.binomial(n, p)`
Poisson	Discrete	λ (average rate)	Rare event counting	`rng.poisson(lam)`
Uniform	Continuous	a, b (range)	Random initialization	`rng.uniform(a, b)`
Normal	Continuous	μ, σ (mean, standard deviation)	Noise, weight initialization	`rng.normal(mu, sigma)`
Exponential	Continuous	λ (rate)	Time between events	`rng.exponential(1/lam)`

After learning this, what question should you take to the next section?

After looking at distributions, the most valuable questions to carry forward are:

If I already know what a certain distribution looks like, how do I infer its parameters from observed data?
What does "the model that best explains the data" actually mean?
When I see a difference in an A/B test, how can I tell whether it is a real difference or just random fluctuation?

These questions will naturally lead you to:

4.2.4 Basics of Statistical Inference

Connecting ahead

Next section: Statistical inference — inferring distribution parameters from data
5 Introduction to Machine Learning and Practice: Logistic regression uses the sigmoid function to output the Bernoulli distribution parameter p
6 Fundamentals of Deep Learning and Transformers: Neural network weights are initialized with a normal distribution (He/Xavier initialization)
7 Principles of Large Models, Prompting, and Fine-Tuning: VAE models assume latent variables follow a normal distribution

Summary

Concept	Intuition
Probability distribution	The "map of possibilities" for a random variable
Discrete distribution	Takes a finite set of values, each with a definite probability
Continuous distribution	Takes any value, described with a probability density function
PMF	Probability assigned to each discrete value
PDF	Density curve for continuous values; probabilities are areas under the curve
CDF	Accumulated probability up to a value
Normal distribution	The most important distribution — a bell curve determined by μ and σ
Central Limit Theorem	Sample means tend toward a normal distribution, regardless of the original distribution

What you should take away from this section

The most important intuition about probability distributions is "what does the whole random phenomenon look like"
Bernoulli, binomial, and Poisson are for discrete counting problems
The normal distribution and the Central Limit Theorem will appear repeatedly later in AI

Hands-on Practice

Exercise 1: Plot all distributions

In a 2×3 subplot grid, plot Bernoulli, binomial, Poisson, uniform, normal, and exponential distributions.

Reference implementation:

rng = np.random.default_rng(seed=42)
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()

axes[0].bar([0, 1], [0.4, 0.6], color=["coral", "steelblue"])
axes[0].set_title("Bernoulli(p=0.6)")

x = np.arange(0, 21)
axes[1].bar(x, stats.binom.pmf(x, 20, 0.5), color="steelblue")
axes[1].set_title("Binomial(n=20, p=0.5)")

x = np.arange(0, 16)
axes[2].bar(x, stats.poisson.pmf(x, 5), color="mediumseagreen")
axes[2].set_title("Poisson(lambda=5)")

samples = rng.uniform(0, 1, 10000)
axes[3].hist(samples, bins=40, density=True, color="steelblue", alpha=0.7)
axes[3].set_title("Uniform(0, 1)")

x = np.linspace(-4, 4, 300)
axes[4].plot(x, stats.norm.pdf(x), color="black")
axes[4].set_title("Normal(0, 1)")

samples = rng.exponential(1, 10000)
axes[5].hist(samples, bins=40, density=True, color="orange", alpha=0.7)
axes[5].set_title("Exponential(scale=1)")

plt.tight_layout()
plt.show()

Exercise 2: Verify 68-95-99.7

Generate 100000 height data points from N(170, 5) (mean 170 cm, standard deviation 5 cm), and verify what proportion of people have heights between 160 and 180 cm (±2σ).

Reference implementation:

rng = np.random.default_rng(seed=42)
heights = rng.normal(170, 5, 100000)
within = ((heights >= 160) & (heights <= 180)).mean()
print(f"Height within 160-180 cm: {within:.1%}")

Expected output:

Height within 160-180 cm: 95.4%

Exercise 3: Central Limit Theorem experiment

Use dice (uniform distribution from 1 to 6) to perform a Central Limit Theorem experiment: roll the dice 1 time, 10 times, 50 times, and 200 times, compute the average each time, repeat each group 10000 times, and plot the distribution of the averages.

Reference implementation:

rng = np.random.default_rng(seed=42)
for n_rolls in [1, 10, 50, 200]:
    means = rng.integers(1, 7, size=(10000, n_rolls)).mean(axis=1)
    print(f"Dice n={n_rolls}: mean={means.mean():.3f}, std={means.std():.3f}")

Expected output:

Dice n=1: mean=3.475, std=1.704
Dice n=10: mean=3.503, std=0.541
Dice n=50: mean=3.499, std=0.241
Dice n=200: mean=3.500, std=0.120

The mean stays close to 3.5, while the standard deviation of the averages becomes smaller. That is the Central Limit Theorem becoming visible in code.

Learning Objectives​

Terms to Decode Before Plotting​

First, a very important learning expectation​

First, build a map​

What is a probability distribution?​

A more beginner-friendly analogy​

Discrete distributions​

Bernoulli distribution — only two outcomes​

Binomial distribution — the sum of multiple Bernoulli trials​

Poisson distribution — counting "rare events"​

Continuous distributions​

Uniform distribution — completely random​

Normal distribution (Gaussian distribution) — the most important distribution​

The 68-95-99.7 rule​

Applications of the normal distribution in AI​

The Central Limit Theorem — the most important theorem​

Core idea​

Verify it with code​

The effect of sample size​

Distribution overview table​

After learning this, what question should you take to the next section?​

Summary​

What you should take away from this section​

Hands-on Practice​

Exercise 1: Plot all distributions​

Exercise 2: Verify 68-95-99.7​

Exercise 3: Central Limit Theorem experiment​

Learning Objectives

Terms to Decode Before Plotting

First, a very important learning expectation

First, build a map

What is a probability distribution?

A more beginner-friendly analogy

Discrete distributions

Bernoulli distribution — only two outcomes

Binomial distribution — the sum of multiple Bernoulli trials

Poisson distribution — counting "rare events"

Continuous distributions

Uniform distribution — completely random

Normal distribution (Gaussian distribution) — the most important distribution

The 68-95-99.7 rule

Applications of the normal distribution in AI

The Central Limit Theorem — the most important theorem

Core idea

Verify it with code

The effect of sample size

Distribution overview table

After learning this, what question should you take to the next section?

Summary

What you should take away from this section

Hands-on Practice

Exercise 1: Plot all distributions

Exercise 2: Verify 68-95-99.7

Exercise 3: Central Limit Theorem experiment