Notes of the Overview of Information Theory

EthanQuanrunWu510 · April 1, 2026, 2:47pm

The information is closely related to physics, specifically with Statistical Mechanics, which is mainly introduced by Jaynes[^1]. For my FYP I am know reading some material covering information theory [^2]. Here below will be my notes taken from the first chapter of [^2].

For a random variable X with probability distribution p(x), Informational Entropy is defined as:

H(X) = -\sum_x p(x) \log_2 p(x)

The entropy is a measure of the average uncertainty in the random variable. It represents the number of bits required on average to describe the random variable.

The benchmarks for the entropy of a discrete distribution with n possible states are:

Lower Bound: A deterministic distribution where H = -1 \log_2 1 = 0.
Upper Bound: A uniform distribution (representing a state where nothing is known about the system) where H = n \times \left(-\frac{1}{n} \log_2 \frac{1}{n}\right) = \log_2 n.

We can define Conditional Entropy H(X|Y), which is the entropy of a random variable conditional on the knowledge of another random variable. The reduction in uncertainty due to knowledge of another variable is called Mutual Information. For two random variables X and Y, this reduction is:

I(X; Y) = H(X) - H(X|Y) = \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}

Mutual information I(X; Y) is a measure of the dependence between two random variables. It is symmetric in X and Y, always non-negative, and equals zero if and only if X and Y are independent.

The key structure in this expression is the ratio \frac{p(x,y)}{p(x)p(y)}, which indicates the level of independence. The final result can be derived as:

\frac{p(x,y)}{p(x)p(y)} = \frac{p(x|y)p(y)}{p(x)p(y)} = \frac{p(x|y)}{p(x)}

When X is independent of Y, this ratio becomes 1, and I(X; Y) = 0. In this case, knowing Y provides no information to enhance our knowledge of X.

Mutual information is non-negative. It is lower-bounded by 0 when the two variables are independent and upper-bounded by H(X) when Y depends totally on X (e.g., Y=f(X)). For a Markov chain X \to Y \to Z, we have the inequality constraint:

I(X; Y) \geq I(X; Z)

This is encountered in data processing, so it is called the Data Processing Theorem (or Inequality). It corresponds to the Second Law of Thermodynamics regarding increasing entropy. (At this stage, I personally think that mutual information might be better considered as the negative of an entropy rather than being called “information,” as entropy in information theory is the mean value of self-information, while mutual information is also an expected value. This perspective aligns with the idea of entropy increasing.)

A communication channel is a system in which the output depends probabilistically on its input. For a channel with input X and output Y, we define the Capacity C as:

C = \max_{p(x)} I(X; Y)

The capacity is the maximum rate at which we can send information over the channel and recover it at the output with a vanishingly low probability of error.

Mutual information is a special case of a more general quantity called Relative Entropy D(p||q), which measures the “distance” between two probability mass functions p and q:

D(p||q) = \sum_x p(x) \log \frac{p(x)}{q(x)}

While not a true metric, it possesses some metric-like properties. Commonly called KL Divergence, it is used in AI as an objective function to be minimized during the learning process. It characterizes the geometry of probability distributions and interprets Large Deviation Theory, which is closely related to statistical mechanics.(We have an interesting book discussing statistical mechanics with large deviation theory in our library. I will try to find it and update here.)

For completeness, I will mention the Doubling Rate W. For a stock market described by a random vector \mathbf{X} and a portfolio \mathbf{b}, it is defined as:

W = \max_{\mathbf{b}: b_i \ge 0, \sum b_i = 1} \int \log \mathbf{b}^t \mathbf{x} \, dF(\mathbf{x})

This represents the best compounding rate. Miraculously, we should fit our portfolio \mathbf{b} to the market property \mathbf{X} just as one fits the input X to the channel property Y in information theory. I personally think this has less to do with physics.

I believe this overview is sufficient to grasp the spirit of Information Theory. I may update this later with more interesting contents in the comments. For the physical part (which is mainly contributed by Jaynes), I will also try to open a new topic sooner or later.

References

[^1] Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical review , 106 (4), 620.
[^2] Cover, T. M. (1999). Elements of information theory . John Wiley & Sons.