Table of Contents
Fetching ...

On the Structure of Information

Sebastian Gottwald, Daniel A. Braun

TL;DR

The general treatment reveals how information can be decomposed into redundant, unique, and synergistic contributions, a question important in applications from neuroscience to machine learning, yet one for which existing formulations lack consensus on foundational definitions and can violate basic properties such as the chain rule or non-negativity.

Abstract

We characterize information as risk reduction between knowledge states represented by partitions of the underlying probability space. Entropy corresponds to risk reduction from no (or partial) knowledge to full knowledge about a random variable, while information corresponds to risk reduction from no (or partial) knowledge to partial knowledge. This applies to any information measure that is based on expected loss minimization, such as Bregman information, with Shannon information and variance as prominent examples. In each case, fundamental properties like the chain rule, non-negativity, and the relationship between information and divergence are preserved. Because partitions form a lattice under refinement, our general treatment reveals how information can be decomposed into redundant, unique, and synergistic contributions, a question important in applications from neuroscience to machine learning, yet one for which existing formulations lack consensus on foundational definitions and can violate basic properties such as the chain rule or non-negativity. Redundancy corresponds to Aumann's common knowledge, synergy to the gap between separately and jointly observed sources, and unique information is necessarily path-dependent, taking different values depending on what is already known. The resulting partial information decomposition is grounded directly in probability theory, avoids treating scalar information quantities as primitive compositional objects, and yields non-negative terms by construction.

On the Structure of Information

TL;DR

The general treatment reveals how information can be decomposed into redundant, unique, and synergistic contributions, a question important in applications from neuroscience to machine learning, yet one for which existing formulations lack consensus on foundational definitions and can violate basic properties such as the chain rule or non-negativity.

Abstract

We characterize information as risk reduction between knowledge states represented by partitions of the underlying probability space. Entropy corresponds to risk reduction from no (or partial) knowledge to full knowledge about a random variable, while information corresponds to risk reduction from no (or partial) knowledge to partial knowledge. This applies to any information measure that is based on expected loss minimization, such as Bregman information, with Shannon information and variance as prominent examples. In each case, fundamental properties like the chain rule, non-negativity, and the relationship between information and divergence are preserved. Because partitions form a lattice under refinement, our general treatment reveals how information can be decomposed into redundant, unique, and synergistic contributions, a question important in applications from neuroscience to machine learning, yet one for which existing formulations lack consensus on foundational definitions and can violate basic properties such as the chain rule or non-negativity. Redundancy corresponds to Aumann's common knowledge, synergy to the gap between separately and jointly observed sources, and unique information is necessarily path-dependent, taking different values depending on what is already known. The resulting partial information decomposition is grounded directly in probability theory, avoids treating scalar information quantities as primitive compositional objects, and yields non-negative terms by construction.
Paper Structure (16 sections, 3 theorems, 57 equations, 8 figures)

This paper contains 16 sections, 3 theorems, 57 equations, 8 figures.

Key Result

Proposition 1

For any random variable $X:\Omega\to\mathcal{X}$, we have:

Figures (8)

  • Figure 1: Examples of partitions $\pi$ and $\pi'$ of $\Omega = [0,1]$, including their coarsest common refinement (join $\pi\vee \pi'$) and finest common coarsening (meet $\pi\wedge\pi'$). The blocks of $\pi\vee \pi'$ are knowable from the combination of $\pi$ and $\pi'$, while the blocks of $\pi\wedge \pi'$ are sets of common knowledge, because these sets are knowable by both $\pi$ and $\pi'$. In the terminology of Aumann Aumann1976, a decision-maker who performs measurement $\pi$ knows the block of $\pi\wedge \pi'$ that contains $\omega$ and also knows that a decision-maker with measurement $\pi'$ knows that $\omega$ is in that block. They also know that the other decision-maker knows that they know, and so on.
  • Figure 2: A small section of the knowledge lattice of a $6$-element sample space $\Omega=\{1,\dots,6\}$. Partitions connected through a path are comparable with respect to $\preceq$, with upper ones being finer than lower ones. Note that, $\pi$ and $\pi'$ are not comparable with respect to $\preceq$, while both are bound by their meet and join. $\pi$ could for example be the partition corresponding to the random variable $X$ given by $X(\omega)=\lfloor (\omega -1)/2 \rfloor$ (integer division by 2 of $\omega - 1$).
  • Figure 3: Schematic illustration of how specifying a loss projects the knowledge lattice on $\Omega$ to the real line of risks, giving rise to informational quantities, such as entropy $H(X)$, conditional entropy $H(X|\pi)$ and information $I(X;\pi)$, through risk reduction from one knowledge state to another.
  • Figure 4: Partial information decomposition for three variables. When traversing the knowledge lattice along the shown path, the total information $I(X;(Y,Z))$ decomposes into four terms, corresponding to redundant, unique, conditional unique, and synergistic information.
  • Figure 5: Knowledge lattices for the three most basic examples of three variables $X,Y$ and $Z$, each example focussing on a different type of information: $\mathsf{UNQ}$ only contains unique information, $\mathsf{RDN}$ only redundant information, and $\mathsf{XOR}$ only contains synergistic information. The nodes, each representing a partition of $\Omega = \{(0,0),(0,1),(1,0),(1,1)\}$, are visualized using different textures for different blocks of the corresponding partition. Next to the partitions we display the associated distributions over $X$. The non-zero risk reduction terms are calculated for the log loss, mean squared error and zero-one loss, discussed in Appendix \ref{['app:examples']}.
  • ...and 3 more figures

Theorems & Definitions (13)

  • Definition 1: Knowledge lattice I
  • Definition 2: Knowledge lattice II
  • Definition 3: Risk
  • Definition 4: Uncertainty, entropy, and information
  • Proposition 1
  • Proposition 2
  • Definition 5: Redundancy, uniqueness, and synergy
  • Proposition 3: Partial information decomposition
  • Example 1: Square error
  • Example 2: Log loss
  • ...and 3 more