Surprisal-Rényi Free Energy

Shion Matsumoto; Raul Castillo; Benjamin Prada; Ankur Arjun Mali

Surprisal-Rényi Free Energy

Shion Matsumoto, Raul Castillo, Benjamin Prada, Ankur Arjun Mali

TL;DR

This work introduces the Surprisal-R\'enyi Free Energy (SRFE), a log-moment-based functional of the likelihood ratio that lies outside the class of $f-divergences, and identifies SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.

Abstract

The forward and reverse Kullback-Leibler (KL) divergences arise as limiting objectives in learning and inference yet induce markedly different inductive biases that cannot be explained at the level of expectations alone. In this work, we introduce the Surprisal-Rényi Free Energy (SRFE), a log-moment-based functional of the likelihood ratio that lies outside the class of $f$-divergences. We show that SRFE recovers forward and reverse KL divergences as singular endpoint limits and derive local expansions around both limits in which the variance of the log-likelihood ratio appears as a first-order correction. This reveals an explicit mean-variance tradeoff governing departures from KL-dominated regimes. We further establish a Gibbs-type variational characterization of SRFE as the unique minimizer of a weighted sum of KL divergences and prove that SRFE directly controls large deviations of excess code-length via Chernoff-type bounds, yielding a precise Minimum Description Length interpretation. Together, these results identify SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.

Surprisal-Rényi Free Energy

TL;DR

Abstract

-divergences. We show that SRFE recovers forward and reverse KL divergences as singular endpoint limits and derive local expansions around both limits in which the variance of the log-likelihood ratio appears as a first-order correction. This reveals an explicit mean-variance tradeoff governing departures from KL-dominated regimes. We further establish a Gibbs-type variational characterization of SRFE as the unique minimizer of a weighted sum of KL divergences and prove that SRFE directly controls large deviations of excess code-length via Chernoff-type bounds, yielding a precise Minimum Description Length interpretation. Together, these results identify SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.

Paper Structure (59 sections, 37 theorems, 202 equations, 7 figures, 6 tables, 7 algorithms)

This paper contains 59 sections, 37 theorems, 202 equations, 7 figures, 6 tables, 7 algorithms.

Introduction
Contribution
Preliminaries
Notation.
f-divergences
Cressie--Read power divergence family
Motivation for going beyond CR
Surprisal-Rényi Free Energy
Almost disjoint supports
Basic Properties
Second-Order Surprisal Analysis
Interpretation
Gradient Dynamics
Optimization advantages of SRFE
$\tau$-scheduling
...and 44 more sections

Key Result

Theorem 2.3

Let $P$ and $Q$ be probability measures on a countable space. If $P \ll Q$, then If, in addition, $Q \ll P$, then

Figures (7)

Figure 1: Gaussian $Q$ that minimizes forward KL (blue) and reverse KL (red) with a mixture of Gaussians $P$ with means $\mu_1, \mu_2$ where $\mu_2 \ge \mu_1$ and variance $\sigma_1^2=\sigma_2^2=1$. Gaussians are equally weighted.
Figure 2: SRFE updates are driven by the score $\nabla_\theta\log q_\theta$ evaluated under the escort $r_\tau$ (which downweights regions where $q_\theta$ is small when $\tau\in(0,1)$), whereas CR updates are driven by samples from $q_\theta$ with an explicit likelihood-ratio weight $u^\tau$ that can amplify variance when $q_\theta\ll p$.
Figure 3: Experiment 1 - $\tau\in\{0.3, 0.5, 0.7, 0.9\}$ tends to spread out, matching Forward KL performance; $\tau\in\{0.1\}$ and Reverse KL hone in on fewer modes.
Figure 4: Experiment 2 - $\tau$-sweep for $\tau$ with optimal mode coverage, entropy error, effective sample size, and test log-likelihood (in red).
Figure 5: Experiment 3 - Fixed $\tau=0.99$ schedule is unstable when compared to the other schedules
...and 2 more figures

Theorems & Definitions (60)

Definition 2.1: $f$-divergence
Definition 2.2: Cressie--Read power divergence
Theorem 2.3: CR limits to KL divergences
Definition 3.0: SRFE and associated CR
Theorem 3.1: KL limits
Lemma 3.1: Nonnegativity
Lemma 3.1: Monotone equivalence with CR
Theorem 3.2: SRFE is not an $f$-divergence
Theorem 3.3: Surprisal-variance expansion for standard CR
Theorem 3.4: SRFE local expansion around forward KL
...and 50 more

Surprisal-Rényi Free Energy

TL;DR

Abstract

Surprisal-Rényi Free Energy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (60)