Table of Contents
Fetching ...

A visual introduction to information theory

Henry Pinkard, Laura Waller

TL;DR

A visual, intuition-driven guide to key concepts in information theory, showing how entropy, mutual information, and channel capacity arise from probability and govern these limits of data compression and transmission in the presence of noise.

Abstract

Though originally developed for communications engineering, information theory provides mathematical tools with broad applications across science and engineering. These tools characterize the fundamental limits of data compression and transmission in the presence of noise. Here, we present a visual, intuition-driven guide to key concepts in information theory, showing how entropy, mutual information, and channel capacity arise from probability and govern these limits. Our presentation assumes only a familiarity with basic probability theory.

A visual introduction to information theory

TL;DR

A visual, intuition-driven guide to key concepts in information theory, showing how entropy, mutual information, and channel capacity arise from probability and govern these limits of data compression and transmission in the presence of noise.

Abstract

Though originally developed for communications engineering, information theory provides mathematical tools with broad applications across science and engineering. These tools characterize the fundamental limits of data compression and transmission in the presence of noise. Here, we present a visual, intuition-driven guide to key concepts in information theory, showing how entropy, mutual information, and channel capacity arise from probability and govern these limits. Our presentation assumes only a familiarity with basic probability theory.
Paper Structure (44 sections, 45 equations, 18 figures)

This paper contains 44 sections, 45 equations, 18 figures.

Figures (18)

  • Figure 1: Equivalence of probability and informationa) A sequence of two marbles is drawn at random (with replacement) from an urn, giving rise to b) a probability distribution over the 16 possible two-color sequences. c) Learning that a proposition about the two colors drawn is true enables the elimination of certain outcomes. For example, learning neither marble is blue eliminates $\frac{7}{16}$ possibilities containing $\frac{3}{4}$ of the probability mass. Eliminating probability mass, reducing uncertainty about the outcome, and gaining information are all mathematically equivalent. Reduction of 50% of the probability mass corresponds to 1 bit of information
  • Figure 2: Entropy can be interpreted as the average length of the shortest encoding of a sequence of random events, which here are repeated draws (with replacement) of colored marbles from an urn. (Top) With equal probability of each color, the shortest binary recording assigns a two-digit binary string code to each event. The entropy is the average number of bits per event of a typical sequence: 2 bits. (Bottom) When some colors are more likely than others, the more probable ones can be recorded as shorter binary strings to save space. This gives a shorter entropy: 1.75 bits.
  • Figure 3: Typical sequencesa) Example sequences of independent and identically distributed events with increasing length ($N$). b) Histograms of the information (i.e. $-\frac{\log p(x)}{N}$) of each possible sequence with length $N$. Black shows the histogram of every possible sequence. Magenta shows the distribution of probability-weighted sequences (i.e. the expected distribution one would get by taking a random sample). As $N$ increases, nearly all of the probability mass concentrates on a tiny subset of the total number of sequences: typical sequences. There are $\approx 2^{NH(\mathrm{X})}$ typical sequences each with probability $\approx 2^{-NH(\mathrm{X})}$.
  • Figure 4: Probability, redundancy, and typicality. a) The redundancy of a random variable $\mathrm{X}$ is equal to the difference between its entropy $H(\mathrm{X})$ and the maximum possible entropy on its probability space $H_{\text{max}}(\mathcal{X})$. b) Distributions with more concentrated probability mass have higher redundancy. (Top) The equal probability case, (Bottom) the concentrated probability case. (Left) Probability distribution over a single event of an independent and identically distributed sequence, (Middle) a typical sequence of events from this distribution. (Right) The entropy, redundancy and maximum entropy.
  • Figure 5: Mutual information describes the relationship between two random variables. Here those random variables are the shape and color of an object drawn at random. The joint distribution of shape and color determines the amount of mutual information. (Top row) 2 bits of mutual information, (middle row) 1 bit of mutual information, (bottom) 0 bits of mutual information. a) The joint distribution of shape and color, with uniform probability over all possible shape/color combinations shown. b) Mapping view showing the colors, possible shape color combinations, possible shapes, and the possibilities for colors that can be inferred from shape alone. Line thickness shows strength of the relationship. c) Compact view that omits the joint distribution and color inference. d) More of the entropy of the two events is shared with greater mutual information.
  • ...and 13 more figures