Table of Contents
Fetching ...

A Hierarchical Decomposition of Kullback-Leibler Divergence: Disentangling Marginal Mismatches from Statistical Dependencies

William Cook

TL;DR

This work addresses the challenge of interpreting KL divergence in high-dimensional settings by deriving an exact decomposition of $\mathrm{KL}(P_k \| Q^{(\otimes k)})$ into a sum of marginal divergences and a hierarchical total correlation component. Using Shannon entropy and Möbius inversion on the subset lattice, it expresses the total correlation as a sum of higher-order interaction informations $I^{(r)}(P_k)$, yielding $\mathrm{KL}(P_k \| Q^{(\otimes k)}) = \sum_{i=1}^k \mathrm{KL}(P_i \| Q) + \sum_{r=2}^k I^{(r)}(P_k)$. The authors validate the decomposition numerically with multivariate hypergeometric models, showing machine-precision agreement and uncovering the structure of marginal versus interaction contributions. Beyond theoretical appeal, the framework provides a practical diagnostic tool across domains—machine learning, biology, neuroscience, and economics—for identifying whether divergence stems from univariate misfits or complex dependencies. They also discuss extensions to general product references, continuous variables, and computational considerations, outlining future directions for scalable estimation and application.

Abstract

The Kullback-Leibler (KL) divergence is a foundational measure for comparing probability distributions. Yet in multivariate settings, its single value often obscures the underlying reasons for divergence, conflating mismatches in individual variable distributions (marginals) with effects arising from statistical dependencies. We derive an algebraically exact, additive, and hierarchical decomposition of the KL divergence between a joint distribution P(X1,...,Xn) and a standard product reference distribution Q(X1,...,Xn) = product_i q(Xi), where variables are assumed independent and identically distributed according to a common reference q. The total divergence precisely splits into two primary components: (1) the summed divergence of each marginal distribution Pi(Xi) from the common reference q(Xi), quantifying marginal deviations; and (2) the total correlation (or multi-information), capturing the total statistical dependency among variables. Leveraging Mobius inversion on the subset lattice, we further decompose this total correlation term into a hierarchy of signed contributions from distinct pairwise, triplet, and higher-order statistical interactions, expressed using standard Shannon information quantities. This decomposition provides an algebraically complete and interpretable breakdown of KL divergence using established information measures, requiring no approximations or model assumptions. Numerical validation using hypergeometric sampling confirms the decomposition's exactness to machine precision across diverse system configurations.

A Hierarchical Decomposition of Kullback-Leibler Divergence: Disentangling Marginal Mismatches from Statistical Dependencies

TL;DR

This work addresses the challenge of interpreting KL divergence in high-dimensional settings by deriving an exact decomposition of into a sum of marginal divergences and a hierarchical total correlation component. Using Shannon entropy and Möbius inversion on the subset lattice, it expresses the total correlation as a sum of higher-order interaction informations , yielding . The authors validate the decomposition numerically with multivariate hypergeometric models, showing machine-precision agreement and uncovering the structure of marginal versus interaction contributions. Beyond theoretical appeal, the framework provides a practical diagnostic tool across domains—machine learning, biology, neuroscience, and economics—for identifying whether divergence stems from univariate misfits or complex dependencies. They also discuss extensions to general product references, continuous variables, and computational considerations, outlining future directions for scalable estimation and application.

Abstract

The Kullback-Leibler (KL) divergence is a foundational measure for comparing probability distributions. Yet in multivariate settings, its single value often obscures the underlying reasons for divergence, conflating mismatches in individual variable distributions (marginals) with effects arising from statistical dependencies. We derive an algebraically exact, additive, and hierarchical decomposition of the KL divergence between a joint distribution P(X1,...,Xn) and a standard product reference distribution Q(X1,...,Xn) = product_i q(Xi), where variables are assumed independent and identically distributed according to a common reference q. The total divergence precisely splits into two primary components: (1) the summed divergence of each marginal distribution Pi(Xi) from the common reference q(Xi), quantifying marginal deviations; and (2) the total correlation (or multi-information), capturing the total statistical dependency among variables. Leveraging Mobius inversion on the subset lattice, we further decompose this total correlation term into a hierarchy of signed contributions from distinct pairwise, triplet, and higher-order statistical interactions, expressed using standard Shannon information quantities. This decomposition provides an algebraically complete and interpretable breakdown of KL divergence using established information measures, requiring no approximations or model assumptions. Numerical validation using hypergeometric sampling confirms the decomposition's exactness to machine precision across diverse system configurations.

Paper Structure

This paper contains 17 sections, 2 theorems, 31 equations, 3 figures.

Key Result

Lemma 2.8

The sum of interaction information terms of order 2 and higher equals the total correlation:

Figures (3)

  • Figure 1: Panel A: Hierarchical decomposition for Case 1 ($k=3$, symmetric).
  • Figure 2: Panel B: Boolean lattice illustration for Case 1.
  • Figure 3: Panel C: Empirical validation via stacked bar chart for three test cases (Case 1: $k=3$, symm; Case 2: $k=2$, asymm; Case 4: $k=4$, symm). Bars show the components of the recomposed KL divergence based on Theorem \ref{['thm:KLDecomp']}: Sum of marginal KLs $\sum \mathrm{KL}(P_i \| Q)$ (rust red, negligible in these cases as $P_i \approx Q$), total pairwise interaction $I^{(2)}$ (blue), total three-way interaction $I^{(3)}$ (purple), and total four-way interaction $I^{(4)}$ (teal, Case 4 only). The black line marks the independently computed total $\mathrm{KL}(P_k \| Q^{\otimes k})$. The near-perfect match between the stacked bars and the black line (residuals $<10^{-15}$ bits, see Appendix \ref{['app:validationsummary']}) numerically validates the exactness of the decomposition theorem across different system sizes and symmetries.

Theorems & Definitions (12)

  • Definition 2.2: Interaction Information
  • Remark 2.3: Sign Convention Warning
  • Remark 2.4: Sign Convention and Examples
  • Definition 2.5: Total $r$-way Interaction Information
  • Definition 2.6: Total Correlation
  • Definition 2.7: Kullback-Leibler Divergence
  • Lemma 2.8: Total Correlation from Interaction Information
  • proof
  • Theorem 2.9: Hierarchical KL Decomposition
  • proof
  • ...and 2 more