Table of Contents
Fetching ...

Surprisal-Rényi Free Energy

Shion Matsumoto, Raul Castillo, Benjamin Prada, Ankur Arjun Mali

TL;DR

This work introduces the Surprisal-R\'enyi Free Energy (SRFE), a log-moment-based functional of the likelihood ratio that lies outside the class of $f-divergences, and identifies SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.

Abstract

The forward and reverse Kullback-Leibler (KL) divergences arise as limiting objectives in learning and inference yet induce markedly different inductive biases that cannot be explained at the level of expectations alone. In this work, we introduce the Surprisal-Rényi Free Energy (SRFE), a log-moment-based functional of the likelihood ratio that lies outside the class of $f$-divergences. We show that SRFE recovers forward and reverse KL divergences as singular endpoint limits and derive local expansions around both limits in which the variance of the log-likelihood ratio appears as a first-order correction. This reveals an explicit mean-variance tradeoff governing departures from KL-dominated regimes. We further establish a Gibbs-type variational characterization of SRFE as the unique minimizer of a weighted sum of KL divergences and prove that SRFE directly controls large deviations of excess code-length via Chernoff-type bounds, yielding a precise Minimum Description Length interpretation. Together, these results identify SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.

Surprisal-Rényi Free Energy

TL;DR

This work introduces the Surprisal-R\'enyi Free Energy (SRFE), a log-moment-based functional of the likelihood ratio that lies outside the class of $f-divergences, and identifies SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.

Abstract

The forward and reverse Kullback-Leibler (KL) divergences arise as limiting objectives in learning and inference yet induce markedly different inductive biases that cannot be explained at the level of expectations alone. In this work, we introduce the Surprisal-Rényi Free Energy (SRFE), a log-moment-based functional of the likelihood ratio that lies outside the class of -divergences. We show that SRFE recovers forward and reverse KL divergences as singular endpoint limits and derive local expansions around both limits in which the variance of the log-likelihood ratio appears as a first-order correction. This reveals an explicit mean-variance tradeoff governing departures from KL-dominated regimes. We further establish a Gibbs-type variational characterization of SRFE as the unique minimizer of a weighted sum of KL divergences and prove that SRFE directly controls large deviations of excess code-length via Chernoff-type bounds, yielding a precise Minimum Description Length interpretation. Together, these results identify SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.
Paper Structure (59 sections, 37 theorems, 202 equations, 7 figures, 6 tables, 7 algorithms)

This paper contains 59 sections, 37 theorems, 202 equations, 7 figures, 6 tables, 7 algorithms.

Key Result

Theorem 2.3

Let $P$ and $Q$ be probability measures on a countable space. If $P \ll Q$, then If, in addition, $Q \ll P$, then

Figures (7)

  • Figure 1: Gaussian $Q$ that minimizes forward KL (blue) and reverse KL (red) with a mixture of Gaussians $P$ with means $\mu_1, \mu_2$ where $\mu_2 \ge \mu_1$ and variance $\sigma_1^2=\sigma_2^2=1$. Gaussians are equally weighted.
  • Figure 2: SRFE updates are driven by the score $\nabla_\theta\log q_\theta$ evaluated under the escort $r_\tau$ (which downweights regions where $q_\theta$ is small when $\tau\in(0,1)$), whereas CR updates are driven by samples from $q_\theta$ with an explicit likelihood-ratio weight $u^\tau$ that can amplify variance when $q_\theta\ll p$.
  • Figure 3: Experiment 1 - $\tau\in\{0.3, 0.5, 0.7, 0.9\}$ tends to spread out, matching Forward KL performance; $\tau\in\{0.1\}$ and Reverse KL hone in on fewer modes.
  • Figure 4: Experiment 2 - $\tau$-sweep for $\tau$ with optimal mode coverage, entropy error, effective sample size, and test log-likelihood (in red).
  • Figure 5: Experiment 3 - Fixed $\tau=0.99$ schedule is unstable when compared to the other schedules
  • ...and 2 more figures

Theorems & Definitions (60)

  • Definition 2.1: $f$-divergence
  • Definition 2.2: Cressie--Read power divergence
  • Theorem 2.3: CR limits to KL divergences
  • Definition 3.0: SRFE and associated CR
  • Theorem 3.1: KL limits
  • Lemma 3.1: Nonnegativity
  • Lemma 3.1: Monotone equivalence with CR
  • Theorem 3.2: SRFE is not an $f$-divergence
  • Theorem 3.3: Surprisal-variance expansion for standard CR
  • Theorem 3.4: SRFE local expansion around forward KL
  • ...and 50 more