Self-information, Entropy, Cross Entropy, and Kullback-Leibler Distance

When we develop a model for probabilistic classification, we aim to map the model's inputs to probabilistic predictions, and we often train our model by incrementally adjusting the model's parameters so that our predictions get closer and closer to ground-truth probabilities.

Content

Self-information

Definition

Kullback-Leibler Divergence

Definition

Example

Jensen-Shannon Divergence

Definition

References

Content

Self-information

Definition

Self-information measures the amount of information carried by an event or outcome. It reflects how unexpected that event is.

where:

- I(E) represents the self-information associated with the event or outcome 'E', measured in bits.
- Pr(x) is the probability of that event or outcome 'E' occurring.

If Pr(E) = 1 then I(E) = 0, or if Pr(E) < 1 then I(E) > 0.

Events with low probability (i.e., rare or unexpected events) have high self-information, while events with high probability (i.e., common or expected events) have low self-information.

Example 1

Consider the outcome of flipping a fair coin ==> There are two possible outcomes: heads (H) and tails (T).
Here, the probabilities of each outcome are:
1. p(H) = 0.5 (probability of getting heads).
2. p(T) = 0.5 (probability of getting tails).
The self-information for each outcome:
1. For H (heads): I(H) = -log2(0.5) = -(-1) = 1 bit
2. For T (tails): I(T) = -log2(0.5) = -(-1) = 1 bit
In this case, both outcomes have a self-information of 1 bit, indicating that getting either heads or tails in a fair coin flip provides 1 bit of information and is equally surprising.

Example 2

Consider the outcome of winning a lottery, with a tiny probability of around 1 in 100 million.
Its self-information would be significantly higher:

I(E) = log2(1/(1/100 million)) ≈ 35 bit

Winning the lottery holds roughly 35 bits of information, reflecting its extreme surprise factor due to the incredibly low probability.

Entropy

Definition

Entropy is a measurement of the uncertainty of a system. Intuitively, it is the amount of information needed to remove uncertainty from the system.
The entropy of a probability distribution p for various states of a system can be computed as follows:

In other words, the entropy calculates the “unpredictability” of a random variable, which means that it will return a high value for a random variable with a high level of unpredictability and a low value for the one that is more easily predictable.

Example 1

Compute the entropy of a fair coin. Given:
1. P(X=heads) = 0.5
2. P(X=tails) = 0.5

Example 2

Compute the entropy of an unfair coin. Given:
1. P(X=heads) = 0.75
2. P(X=tails) = 0.25

=> The entropy value is less than that in example 1. Because the coin will appear heads most of the time, which means it is more predictable (less unpredictable) ==> Thus, the entropy is low.

Example 3

Compute the entropy of an unfair 6-sided coin. Given:
1. P(X=1) = 0.5
2. P(X=2) = 0.25
3. P(X=3) = 0
4. P(X=4) = 0
5. P(X=5) = 0.125
6. P(X=6) = 0.125

=> The more values a random variable can take, the more unpredictable it becomes.

Cross-Entropy

Definition

The term Cross-entropy refers to the amount of information that exists between two probability distributions.
In this case, the cross-entropy of distribution p and q can be formulated as follows:
1. p(X) is the probability of class X in TARGET (Ground-truth distribution).
2. q(X) is the probability of class X in PREDICTION (Predicted distribution).

Note: H(p,q) ≠ H(q,p)

In other words, Cross-entropy is a measure of how well one probability distribution 'p' (the true distribution or ground truth) is approximated by another probability distribution 'q' (the predicted or estimated distribution).
It quantifies the dissimilarity between the two distributions by evaluating how well 'q' represents 'p'.
1. A lower cross-entropy value indicates a better match between the two distributions.
2. A higher value suggests a greater dissimilarity.
In classification tasks, minimizing cross-entropy is a common objective, as it encourages the model to assign higher probabilities to the true class labels.

Example

Consider a binary classification problem, where 'p' is the true distribution of class labels and 'q' is the predicted distribution:
1. 1. 'p' represents the true distribution of class labels:
    - p(class=0) = 0.3
    - p(class=1) = 0.7
  2. 'q' represents the predicted distribution of class labels from a machine learning model:
    - q(class=0) = 0.4
    - q(class=1) = 0.6
To calculate the cross-entropy between 'p' and 'q':

H(p, q) = - [0.3 * log(0.4) + 0.7 * log(0.6)]

H(p, q) ≈ 0.673

The result, approximately 0.673, indicates the cross-entropy between the true class distribution 'p' and the predicted class distribution 'q'.

Kullback-Leibler Divergence

Definition

The KL Divergence is an asymmetric statistical distance measure of how much one probability distribution P differs from a reference distribution Q.
The relative cross-entropy from Q to P (aka the relative entropy of P with respect to Q), denoted as:

Where the first term on the right side of the equation is the expectation of distribution q in terms of p and the second term is the entropy of the distribution p.
In most of real world applications, p is the actual data/measurement while q is the hypothetical distribution.
In case of GANs, p is the probability distribution of real images while q is the probability distribution of fake images.

The double bars indicate that the function is not symmetric with respect to its arguments.

Illustration of the relative entropy for two normal distributions. The typical asymmetry is clearly visible.

KL Divergence measures how the information content in 'p' differs from that in 'q'.
1. If KL Divergence is zero, it means that the two distributions are identical.
2. If it is greater than zero, it indicates that there is information lost when using 'q' to approximate 'p'.
3. If it is less than zero, it typically doesn't have a practical interpretation.

Example

Let's say you have a discrete probability distribution 'p' representing the actual distribution of the outcomes of rolling a fair six-sided die:
1. p(1) = 1/6
2. p(2) = 1/6
3. p(3) = 1/6
4. p(4) = 1/6
5. p(5) = 1/6
6. p(6) = 1/6
Now, let's consider a reference distribution 'q' that represents a biased die with the following probabilities:
1. q(1) = 1/2
2. q(2) = 1/6
3. q(3) = 1/12
4. q(4) = 1/12
5. q(5) = 1/12
6. q(6) = 1/12
To calculate the KL Divergence from 'p' to 'q':

KL(p||q) = (1/6) * log((1/6) / (1/2)) + (1/6) * log((1/6) / (1/6)) + (1/6) * log((1/6) / (1/12)) + (1/6) * log((1/6) / (1/12)) + (1/6) * log((1/6) / (1/12)) + (1/6) * log((1/6) / (1/12))

KL(p||q) ≈ 0.386

=> This result tells you that using the biased distribution 'q' to approximate the fair die distribution 'p' results in approximately 0.386 units of information loss per roll of the die.

Jensen-Shannon Divergence

Definition

The Jensen-Shannon Divergence (JSD) is a measure of the similarity or dissimilarity between two probability distributions.
Characteristics:
1. Symmetry: JSD is symmetric, meaning the distance between distribution P and Q is the same as the distance between Q and P. This makes it useful for comparing distributions without a clear reference or baseline.
2. Bounded: JSD values always fall between 0 and 1, with 0 indicating identical distributions and 1 indicating completely different distributions. This bounded nature helps in interpretation and comparison.
3. Based on Kullback-Leibler Divergence (KLD): JSD is derived from KLD, another measure of distribution divergence. However, JSD addresses a limitation of KLD by being symmetric and bounded.

References

https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/#introduction
https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/?fbclid=IwAR23TpMuuOrTs8pBGUSM1E1XIofhHY3OGpCG3OdIdc2otQyKetMk1lDJqOw
https://towardsdatascience.com/why-is-cross-entropy-equal-to-kl-divergence-d4d2ec413864
Wikipedia contributors. “Self-information.” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 31 Jul. 2018. Web. 5 Aug. 2018.
Wikipedia contributors. “History of entropy.” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 31 Jul. 2018. Web. 6 Aug. 2018.
https://phamdinhkhanh.github.io/2020/07/25/GAN_Wasserstein.html#4-jensen-shannon

Page updated

Google Sites

Report abuse

This site uses cookies from Google to deliver its services and to analyze traffic. Information about your use of this site is shared with Google. By clicking "accept", you agree to its use of cookies. Cookie Policy

Reject

Self-information, Entropy, Cross Entropy, and Kullback-Leibler Distance

Content

Self-information

Definition

Example 1

Example 2

Entropy

Definition

Example 1

Example 2

Example 3

Cross-Entropy

Definition

Example

Kullback-Leibler Divergence

Definition

Example

Jensen-Shannon Divergence

Definition

References

About Me: