Skip to content

4.2.5 Fundamentals of Information Theory

Information entropy and uncertainty diagram

  • Understand information content and entropy — measures of uncertainty
  • Understand cross-entropy — the difference between two distributions
  • Understand KL divergence — the “distance” from one distribution to another
  • Compute and visualize everything with Python

Information theory terms can sound abstract, but most of them answer very concrete questions:

TermMeaningBeginner-friendly question
bitUnit of information when using log2How many yes/no questions would this information be worth?
natUnit of information when using natural log lnThe unit most deep learning loss functions use internally
log2Logarithm base 2Used when we want answers in bits
entropyAverage uncertainty of a distributionBefore seeing the answer, how uncertain are we on average?
cross-entropyCost of predicting distribution Q when truth is PHow expensive is it to use the model’s prediction to explain the truth?
KL divergenceExtra cost from using Q instead of PHow much extra cost do we pay because the model distribution is wrong?
logitsRaw model scores before probability conversionThe unnormalized scores a neural network outputs before softmax
softmaxConverts logits into probabilitiesTurn raw class scores into a probability distribution that sums to 1
perplexity2 ** cross_entropy when cross-entropy uses bitsA language-model metric; lower usually means the model is less confused
RLHF / PPOReinforcement Learning from Human Feedback / Proximal Policy OptimizationTraining methods that use KL to keep a fine-tuned model from drifting too far

This lesson uses log2 for information-theory intuition in bits. Many deep learning libraries use the natural logarithm, so their loss values are in nats. The shape and optimization meaning are the same; only the unit changes.

Historical Background: What Is the Most Important Starting Point for This Section?

Section titled “Historical Background: What Is the Most Important Starting Point for This Section?”

The most important historical milestone in this section is:

YearPaperKey AuthorWhat it Solved Most Importantly
1948A Mathematical Theory of CommunicationClaude ShannonSystematically introduced information content, entropy, and the main framework of modern information theory

For beginners, the most important thing to remember first is:

Shannon made it possible to rigorously measure “how much information” there is for the first time.

So the topics in this section:

  • information content
  • entropy
  • cross-entropy

are not separate ideas, but all part of the same information theory framework.

First, Set the Right Learning Expectations

Section titled “First, Set the Right Learning Expectations”

The most common sticking points for beginners in this section are:

  • the names sound very abstract
  • the formulas look like “math about math”

But what matters most here is not memorizing every definition first. Instead, start by understanding:

  • why “the more surprising something is, the more information it contains”
  • why “the more uncertain something is, the higher its entropy”
  • why classification loss is connected to these quantities

You can think of this section as:

A more precise language for describing how certain a model is and how far its prediction is from the truth.


This section may not look much like a “probability lesson,” but it is very closely related to model training.

Map from information theory to loss function

So what this lesson really wants to explain is:

  • why “more surprising events carry more information”
  • why a distribution with more uncertainty has higher entropy
  • why cross-entropy becomes the core loss function for classification tasks

Information Content — “How Surprised Are You?”

Section titled “Information Content — “How Surprised Are You?””

The information content of a message = how unexpected it is.

  • “The sun rises in the east” → information content ≈ 0 (not surprising at all)
  • “It snowed in Beijing today” (in summer) → very high information content (very surprising)
  • “It snowed in Beijing today” (in winter) → moderate information content

The lower the probability of an event, the more information it brings when it happens.

You can think of information content as:

  • “How much should I be surprised by this?”

For example:

  • The sun rises tomorrow: not worth being surprised about
  • It snows in summer: definitely surprising

So the most important thing to remember about information content is not the formula, but this:

The less common something is, the more information it gives when it happens.

Information content = -log2(probability)

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False
# Information content for different probabilities
probs = np.linspace(0.01, 1, 100)
info = -np.log2(probs)
plt.figure(figsize=(8, 5))
plt.plot(probs, info, color='steelblue', linewidth=2)
plt.xlabel('Event Probability')
plt.ylabel('Information Content (bits)')
plt.title('Information Content = -log₂(Probability)')
plt.grid(True, alpha=0.3)
# Mark a few key points
for p, label in [(1.0, 'Certain event'), (0.5, 'Coin flip'), (0.01, 'Rare event')]:
i = -np.log2(p)
plt.annotate(f'{label}\np={p}, info={i:.1f}bit',
xy=(p, i), fontsize=10,
xytext=(p+0.15, i+0.5),
arrowprops=dict(arrowstyle='->', color='gray'))
plt.show()
Event ProbabilityInformation ContentIntuition
1.00 bitIt must happen, so it carries no information
0.51 bitOne coin flip gives 1 bit
0.252 bitsGuessing a two-digit binary number
0.016.64 bitsVery surprising, so it carries a lot of information

If you run the formula directly:

for p in [1.0, 0.5, 0.25, 0.01]:
print(f"p={p:>4}: information={-np.log2(p):.4f} bits")

Expected output:

Terminal window
p= 1.0: information=-0.0000 bits
p= 0.5: information=1.0000 bits
p=0.25: information=2.0000 bits
p=0.01: information=6.6439 bits

Entropy = the “average information content” of a distribution = the system’s “average uncertainty.”

The Most Important Thing to Remember About Entropy Is Not the Definition, but the “Degree of Chaos”

Section titled “The Most Important Thing to Remember About Entropy Is Not the Definition, but the “Degree of Chaos””

You can first think of entropy as:

  • how unsure you are before making a judgment

If a system is usually almost certain:

  • entropy is low

If a system is hard to predict every time:

  • entropy is high

So the most basic meaning of entropy is:

How large is the average uncertainty?

flowchart LR
A["High certainty<br/>Low entropy"] --> E["Example: 99% sunny<br/>Almost no need to guess"]
B["High uncertainty<br/>High entropy"] --> F["Example: 50% sunny 50% rainy<br/>Impossible to guess"]
style A fill:#e8f5e9,stroke:#2e7d32,color:#333
style B fill:#ffebee,stroke:#c62828,color:#333

H(X) = -Σ p(x) × log2(p(x))

def entropy(probs):
"""Compute entropy (in bits)"""
probs = np.array(probs)
# Avoid log(0)
probs = probs[probs > 0]
return -np.sum(probs * np.log2(probs))
# Example 1: fair coin (maximum uncertainty)
h1 = entropy([0.5, 0.5])
print(f"Entropy of a fair coin: {h1:.3f} bit") # 1.0
# Example 2: unfair coin
h2 = entropy([0.9, 0.1])
print(f"Entropy of an unfair coin (0.9, 0.1): {h2:.3f} bit") # 0.469
# Example 3: certain event (no uncertainty)
h3 = entropy([1.0, 0.0])
print(f"Entropy of a certain event: {h3:.3f} bit") # 0.0
# Example 4: fair die
h4 = entropy([1/6]*6)
print(f"Entropy of a fair die: {h4:.3f} bit") # 2.585

Expected output:

Terminal window
Entropy of a fair coin: 1.000 bit
Entropy of an unfair coin (0.9, 0.1): 0.469 bit
Entropy of a certain event: -0.000 bit
Entropy of a fair die: 2.585 bit

-0.000 appears because of floating-point rounding. Conceptually, the entropy is exactly 0.

Visualization: How Coin Entropy Changes with p

Section titled “Visualization: How Coin Entropy Changes with p”
p_values = np.linspace(0.001, 0.999, 1000)
entropies = [-p * np.log2(p) - (1-p) * np.log2(1-p) for p in p_values]
plt.figure(figsize=(8, 5))
plt.plot(p_values, entropies, color='steelblue', linewidth=2)
plt.xlabel('Probability of Heads p')
plt.ylabel('Entropy H (bit)')
plt.title('Entropy of a Binary Distribution: Maximum at p=0.5 (Most Uncertain)')
plt.axvline(x=0.5, color='red', linestyle='--', alpha=0.5, label='p=0.5 (maximum entropy)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Key insight: entropy is maximum when p = 0.5 (most uncertain), and entropy is 0 when p = 0 or 1 (completely certain).

ApplicationDescription
Decision treesUse information gain (the reduction in entropy) to choose the best split feature
Model outputsThe lower the entropy of a classification model’s probability distribution, the more “confident” the model is
Data compressionEntropy is the theoretical lower bound of data compression
Language modelsPerplexity = 2^(cross-entropy), used to measure model quality

Cross-Entropy — Measuring “How Accurate the Prediction Is”

Section titled “Cross-Entropy — Measuring “How Accurate the Prediction Is””

Cross-entropy = the average number of bits needed to encode data from distribution P using distribution Q.

You can also temporarily think of cross-entropy as:

  • how far apart the predicted distribution and the true distribution are

That is already enough for beginners. Because later, in classification models, what you really care about is also:

  • whether the model assigns higher probability to the correct class

If Q and P are exactly the same → cross-entropy = entropy of P (the minimum value) If Q and P are very different → cross-entropy is much larger than the entropy of P

H(P, Q) = -Σ p(x) × log2(q(x))

def cross_entropy(p, q):
"""Compute cross-entropy"""
p, q = np.array(p), np.array(q)
# Avoid log(0)
q = np.clip(q, 1e-10, 1)
return -np.sum(p * np.log2(q))
# True distribution P
P = [0.7, 0.2, 0.1] # three-class problem
# Predicted distributions Q (different prediction quality)
Q_good = [0.65, 0.25, 0.10] # good prediction
Q_bad = [0.33, 0.33, 0.34] # poor prediction (uniform guess)
Q_wrong = [0.1, 0.1, 0.8] # wrong prediction
print(f"Entropy of P: {entropy(P):.4f}")
print(f"Cross-entropy, good prediction: {cross_entropy(P, Q_good):.4f}")
print(f"Cross-entropy, poor prediction: {cross_entropy(P, Q_bad):.4f}")
print(f"Cross-entropy, wrong prediction: {cross_entropy(P, Q_wrong):.4f}")

Expected output:

Terminal window
Entropy of P: 1.1568
Cross-entropy, good prediction: 1.1672
Cross-entropy, poor prediction: 1.5952
Cross-entropy, wrong prediction: 3.0219

In classification tasks, minimizing cross-entropy = making the model’s predictions as close as possible to the true distribution.

# Binary cross-entropy loss
def binary_cross_entropy(y_true, y_pred):
"""Binary cross-entropy (equivalent to PyTorch's BCELoss)"""
y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
return -np.mean(
y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
)
# True labels
y_true = np.array([1, 0, 1, 1, 0])
# Predictions of different quality
predictions = {
"Perfect prediction": np.array([1.0, 0.0, 1.0, 1.0, 0.0]),
"Good prediction": np.array([0.9, 0.1, 0.8, 0.9, 0.2]),
"Poor prediction": np.array([0.6, 0.4, 0.6, 0.6, 0.4]),
"Completely wrong": np.array([0.1, 0.9, 0.1, 0.1, 0.9]),
}
for name, y_pred in predictions.items():
loss = binary_cross_entropy(y_true, y_pred)
print(f"{name:18s} → cross-entropy loss = {loss:.4f}")

Output:

Perfect prediction → cross-entropy loss ≈ 0.0000
Good prediction → cross-entropy loss ≈ 0.1525
Poor prediction → cross-entropy loss ≈ 0.5108
Completely wrong → cross-entropy loss ≈ 2.3026

This block uses np.log, the natural logarithm, so the unit is nats. That matches what most deep learning frameworks use for cross-entropy loss.

Visualization: Prediction Accuracy vs. Loss

Section titled “Visualization: Prediction Accuracy vs. Loss”
# Relationship between predicted probability p and loss when the true label y=1
p_pred = np.linspace(0.01, 0.99, 100)
# When y=1, loss = -log(p)
loss_y1 = -np.log(p_pred)
# When y=0, loss = -log(1-p)
loss_y0 = -np.log(1 - p_pred)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(p_pred, loss_y1, color='steelblue', linewidth=2)
axes[0].set_xlabel('Model prediction P(y=1)')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss when the true label is y=1\nThe closer the prediction is to 1, the smaller the loss')
axes[0].grid(True, alpha=0.3)
axes[1].plot(p_pred, loss_y0, color='coral', linewidth=2)
axes[1].set_xlabel('Model prediction P(y=1)')
axes[1].set_ylabel('Loss')
axes[1].set_title('Loss when the true label is y=0\nThe closer the prediction is to 0, the smaller the loss')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Interpretation: cross-entropy loss has a very nice property — when the prediction is wrong (for example, y=1 but the model outputs 0.01), the loss increases sharply, strongly penalizing wrong predictions.

A More Beginner-Friendly Classification Intuition

Section titled “A More Beginner-Friendly Classification Intuition”

You can first think of cross-entropy as:

  • how much confidence you placed on the correct answer

If the correct class is clearly cat, but you assign most of the probability to dog, then cross-entropy will be large. If you assign most of the probability to the correct class, cross-entropy will be smaller.

So the most important thing to remember about cross-entropy is not what the formula looks like, but this:

The more probability the model assigns to the correct answer, the smaller the loss usually is.

Another Small “Single-Sample Cross-Entropy” Example

Section titled “Another Small “Single-Sample Cross-Entropy” Example”
labels = ["cat", "dog", "bird"]
true_label = "dog"
pred_probs = [0.1, 0.7, 0.2]
true_index = labels.index(true_label)
loss = -np.log(pred_probs[true_index])
print("Correct class:", true_label)
print("Model prediction:", dict(zip(labels, pred_probs)))
print("Cross-entropy loss:", round(loss, 4))

Expected output:

Terminal window
Correct class: dog
Model prediction: {'cat': 0.1, 'dog': 0.7, 'bird': 0.2}
Cross-entropy loss: 0.3567

This example is especially good for beginners because it brings abstract distribution language back to the most familiar classification setting:

  • Who is the correct answer?
  • How much probability did you assign to it?
  • How large does that make the loss?

KL Divergence — The “Distance” Between Two Distributions

Section titled “KL Divergence — The “Distance” Between Two Distributions”

KL divergence = how many extra bits you “pay” if you use distribution Q instead of the true distribution P.

Why Does KL Divergence Feel So Intimidating?

Section titled “Why Does KL Divergence Feel So Intimidating?”

Because it looks like a “distance” between distributions, but it is not symmetric.

A safer first step for beginners is to understand it as:

  • KL is not a normal geometric distance
  • It is more like: “If I use the wrong distribution to approximate the true one, how much extra cost will I pay?”

This intuition is already enough to help you understand its role in:

  • VAE
  • distillation
  • RLHF

KL(P || Q) = cross-entropy(P, Q) - entropy(P)

def kl_divergence(p, q):
"""Compute KL divergence"""
p, q = np.array(p), np.array(q)
q = np.clip(q, 1e-10, 1)
p = np.clip(p, 1e-10, 1)
return np.sum(p * np.log2(p / q))
P = [0.7, 0.2, 0.1]
Q1 = [0.65, 0.25, 0.10] # close to P
Q2 = [0.33, 0.33, 0.34] # far from P
print(f"KL(P || Q1): {kl_divergence(P, Q1):.4f} (Q1 is close to P)")
print(f"KL(P || Q2): {kl_divergence(P, Q2):.4f} (Q2 is far from P)")
print(f"KL(P || P): {kl_divergence(P, P):.4f} (P with itself)")

Expected output:

Terminal window
KL(P || Q1): 0.0105 (Q1 is close to P)
KL(P || Q2): 0.4384 (Q2 is far from P)
KL(P || P): 0.0000 (P with itself)
PropertyDescription
Non-negativityKL(P || Q) ≥ 0, with equality if and only if P = Q
AsymmetryKL(P || Q) ≠ KL(Q || P), so it is not a true “distance”
Zero when P=QWhen the two distributions are exactly the same, KL divergence is 0
# Verify asymmetry
P = [0.7, 0.2, 0.1]
Q = [0.33, 0.33, 0.34]
print(f"KL(P || Q) = {kl_divergence(P, Q):.4f}")
print(f"KL(Q || P) = {kl_divergence(Q, P):.4f}")
print("They are not equal!")

Expected output:

Terminal window
KL(P || Q) = 0.4384
KL(Q || P) = 0.4807
They are not equal!

Visualization: How KL Divergence Changes with Distribution Difference

Section titled “Visualization: How KL Divergence Changes with Distribution Difference”
# Binary distribution: P = [0.8, 0.2], let Q vary from [0.01, 0.99] to [0.99, 0.01]
p = 0.8
q_values = np.linspace(0.01, 0.99, 200)
kl_values = [kl_divergence([p, 1-p], [q, 1-q]) for q in q_values]
plt.figure(figsize=(8, 5))
plt.plot(q_values, kl_values, color='steelblue', linewidth=2)
plt.axvline(x=p, color='red', linestyle='--', label=f'q = p = {p} (KL=0)')
plt.xlabel('Value of q')
plt.ylabel('KL(P || Q)')
plt.title(f'KL Divergence: P=[{p}, {1-p}], Q=[q, 1-q]')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
ApplicationDescription
VAEMakes the latent variable distribution close to a standard normal distribution: KL(q(z|x) || N(0,1))
Knowledge distillationMakes the small model’s output distribution close to the large model’s output distribution: minimize KL divergence
RLHFConstrains the fine-tuned model so it does not drift too far from the original model
Policy optimizationRestricts the size of policy updates in PPO

flowchart TD
H["Entropy H(P)<br/>Uncertainty of P itself<br/>Lower bound of information"]
CE["Cross-entropy H(P,Q)<br/>Using Q to encode data from P<br/>= Entropy + KL divergence"]
KL["KL divergence KL(P||Q)<br/>Extra bits paid when Q differs from P<br/>= Cross-entropy - Entropy"]
CE --> H
CE --> KL
H -.->|"H(P,Q) = H(P) + KL(P||Q)"| CE
style H fill:#e3f2fd,stroke:#1565c0,color:#333
style CE fill:#fff3e0,stroke:#e65100,color:#333
style KL fill:#e8f5e9,stroke:#2e7d32,color:#333

Core relationship: cross-entropy = entropy + KL divergence

P = [0.7, 0.2, 0.1]
Q = [0.5, 0.3, 0.2]
h = entropy(P)
ce = cross_entropy(P, Q)
kl = kl_divergence(P, Q)
print(f"Entropy H(P): {h:.4f}")
print(f"Cross-entropy H(P,Q): {ce:.4f}")
print(f"KL divergence: {kl:.4f}")
print(f"H(P) + KL = {h + kl:.4f}") # = cross-entropy ✓

Expected output:

Terminal window
Entropy H(P): 1.1568
Cross-entropy H(P,Q): 1.2796
KL divergence: 0.1228
H(P) + KL = 1.2796

A Comparison Table That’s Very Good for Beginners to Remember

Section titled “A Comparison Table That’s Very Good for Beginners to Remember”
ConceptKey question
Information contentHow surprising is this event?
EntropyHow uncertain is the system on average?
Cross-entropyHow far apart are the prediction and the truth?
KL divergenceIf I use the wrong distribution, how much extra cost will I pay?

This table is especially useful for beginners because it turns this chapter back into an intuitive map instead of a pile of formulas.


After Learning This, Where Should You Go Next?

Section titled “After Learning This, Where Should You Go Next?”

If you have read this entire probability and statistics chapter, what is most worth taking with you is not more formulas, but these questions:

  1. How do probability, cross-entropy, and KL divergence actually become loss functions?
  2. Why do models output probabilities instead of absolute conclusions?
  3. How do these “uncertainty languages” turn into training and evaluation tools in machine learning?

The most natural next reading is usually:


Keep this page’s proof of learning as a small evidence card:

Random Process
event, distribution, sample, likelihood, entropy, or Bayes update
Simulation Or Formula
code or formula used to make uncertainty visible
Output
probability, sample statistic, interval, entropy, or updated belief
Failure Check
base-rate confusion, p-value misuse, sample bias, or mixing probability with certainty
Expected Output
numeric result plus interpretation in plain language
ConceptIntuitionRange
Information contentHow “surprising” an event is≥ 0
EntropyHow “uncertain” a distribution is≥ 0
Cross-entropyHow different the predicted distribution is from the true distribution≥ H(P)
KL divergenceThe “distance” between two distributions≥ 0

What You Should Take Away Most from This Section

Section titled “What You Should Take Away Most from This Section”
  • The most important intuition for information content is: “the more surprising, the more information”
  • The most important intuition for entropy is: “average uncertainty”
  • The most important intuition for cross-entropy is: “how far apart the prediction and the truth are”
  • The most important intuition for KL divergence is: “how much extra cost you pay for using the wrong distribution”

Compute the entropy of the following distributions and explain which one is the most “uncertain”:

  1. [0.25, 0.25, 0.25, 0.25] (a four-sided die)
  2. [0.97, 0.01, 0.01, 0.01] (almost certain)
  3. [0.4, 0.3, 0.2, 0.1] (non-uniform)

Reference implementation:

distributions = [
[0.25, 0.25, 0.25, 0.25],
[0.97, 0.01, 0.01, 0.01],
[0.4, 0.3, 0.2, 0.1],
]
for probs in distributions:
print(f"{probs} -> entropy={entropy(probs):.4f} bits")

Expected output:

Terminal window
[0.25, 0.25, 0.25, 0.25] -> entropy=2.0000 bits
[0.97, 0.01, 0.01, 0.01] -> entropy=0.2419 bits
[0.4, 0.3, 0.2, 0.1] -> entropy=1.8464 bits

The first distribution is the most uncertain because all four outcomes are equally likely.

For a three-class problem: the true label is class 2 (one-hot: [0, 1, 0]), and the model outputs softmax probabilities [0.1, 0.7, 0.2]. Compute the cross-entropy loss. If the model output changes to [0.05, 0.9, 0.05], how will the loss change?

Reference implementation:

P = np.array([0, 1, 0])
Q1 = np.array([0.1, 0.7, 0.2])
Q2 = np.array([0.05, 0.9, 0.05])
loss1 = -np.sum(P * np.log(Q1))
loss2 = -np.sum(P * np.log(Q2))
print(f"Loss with [0.1, 0.7, 0.2]: {loss1:.4f} nats")
print(f"Loss with [0.05, 0.9, 0.05]: {loss2:.4f} nats")

Expected output:

Terminal window
Loss with [0.1, 0.7, 0.2]: 0.3567 nats
Loss with [0.05, 0.9, 0.05]: 0.1054 nats

The second prediction gives more probability to the correct class, so the loss becomes smaller.

Draw a graph showing how KL(P||Q) changes when the true distribution P = [0.6, 0.3, 0.1] is fixed and Q changes along some parameter (for example, q1 from 0.1 to 0.9).

Reference implementation:

P = np.array([0.6, 0.3, 0.1])
q1_values = np.linspace(0.1, 0.9, 200)
kl_values = []
for q1 in q1_values:
# Keep the remaining probability split in a 3:1 ratio.
Q = np.array([q1, (1 - q1) * 0.75, (1 - q1) * 0.25])
kl_values.append(kl_divergence(P, Q))
plt.figure(figsize=(8, 5))
plt.plot(q1_values, kl_values, color="steelblue", linewidth=2)
plt.axvline(x=0.6, color="red", linestyle="--", label="Q = P")
plt.xlabel("q1")
plt.ylabel("KL(P || Q)")
plt.title("KL divergence is smallest when Q matches P")
plt.legend()
plt.grid(alpha=0.3)
plt.show()
for q1 in [0.1, 0.6, 0.9]:
Q = np.array([q1, (1 - q1) * 0.75, (1 - q1) * 0.25])
print(f"q1={q1:.1f}, Q={Q.round(3)}, KL={kl_divergence(P, Q):.4f}")

Expected output:

Terminal window
q1=0.1, Q=[0.1 0.675 0.225], KL=1.0830
q1=0.6, Q=[0.6 0.3 0.1], KL=-0.0000
q1=0.9, Q=[0.9 0.075 0.025], KL=0.4490
Operation guide and checkpoints
  • The entropies should be 2.0000, about 0.2419, and about 1.8464 bits. The uniform four-outcome distribution is most uncertain because no outcome is preferred.
  • For the three-class cross-entropy example, the losses are about 0.3567 and 0.1054 nats. Giving more probability to the true class lowers the loss.
  • A good explanation links this to training: cross-entropy rewards calibrated confidence on the correct class and penalizes confident probability placed on wrong classes.