Table of Contents
Fetching ...

Fenchel-Young Variational Learning

Sophia Sklaviadis, Thomas Moellenhoff, Andre Martins, Mario Figueiredo

TL;DR

This paper introduces a new general class of variational methods based on Fenchel-Young (FY) losses, treated as divergences that generalize (and encompass) the familiar Kullback-Leibler divergence at the core of classical variational learning.

Abstract

From a variational perspective, many statistical learning criteria involve seeking a distribution that balances empirical risk and regularization. In this paper, we broaden this perspective by introducing a new general class of variational methods based on Fenchel-Young (FY) losses, treated as divergences that generalize (and encompass) the familiar Kullback-Leibler divergence at the core of classical variational learning. Our proposed formulation -- FY variational learning -- includes as key ingredients new notions of FY free energy, FY evidence, FY evidence lower bound, and FY posterior. We derive alternating minimization and gradient backpropagation algorithms to compute (or lower bound) the FY evidence, which enables learning a wider class of models than previous variational formulations. This leads to generalized FY variants of classical algorithms, such as an FY expectation-maximization (FYEM) algorithm, and latent-variable models, such as an FY variational autoencoder (FYVAE). Our new methods are shown to be empirically competitive, often outperforming their classical counterparts, and most importantly, to have qualitatively novel features. For example, FYEM has an adaptively sparse E-step, while the FYVAE can support models with sparse observations and sparse posteriors.

Fenchel-Young Variational Learning

TL;DR

This paper introduces a new general class of variational methods based on Fenchel-Young (FY) losses, treated as divergences that generalize (and encompass) the familiar Kullback-Leibler divergence at the core of classical variational learning.

Abstract

From a variational perspective, many statistical learning criteria involve seeking a distribution that balances empirical risk and regularization. In this paper, we broaden this perspective by introducing a new general class of variational methods based on Fenchel-Young (FY) losses, treated as divergences that generalize (and encompass) the familiar Kullback-Leibler divergence at the core of classical variational learning. Our proposed formulation -- FY variational learning -- includes as key ingredients new notions of FY free energy, FY evidence, FY evidence lower bound, and FY posterior. We derive alternating minimization and gradient backpropagation algorithms to compute (or lower bound) the FY evidence, which enables learning a wider class of models than previous variational formulations. This leads to generalized FY variants of classical algorithms, such as an FY expectation-maximization (FYEM) algorithm, and latent-variable models, such as an FY variational autoencoder (FYVAE). Our new methods are shown to be empirically competitive, often outperforming their classical counterparts, and most importantly, to have qualitatively novel features. For example, FYEM has an adaptively sparse E-step, while the FYVAE can support models with sparse observations and sparse posteriors.

Paper Structure

This paper contains 30 sections, 1 theorem, 34 equations, 4 figures, 3 tables.

Key Result

Proposition 1

Let $\mathcal{Q}$ be a set of probability distributions over a space $\mathcal{Z}$, let $\ell(x; z)$ be a measurable loss function, and let $\eta: \mathcal{Z} \rightarrow \mathbb{R}$ be a scoring function (possibly a log-prior). $\Omega: \mathcal{Q} \to \mathbb{R} \cup \{+\infty\}$ be a proper, lowe Then the variational problem is equivalent to a generalized Tsallis maximum entropy problem under

Figures (4)

  • Figure 1: 1D logistic regression with a $\mathcal{N}(0, 1)$ prior with $(2-\rho)$-Gaussian posteriors. Sparse posteriors (larger $\rho$) are closer to the MAP solution, as their truncation avoids sampling any points with high loss. In contrast, heavy-tailed posteriors move farther away from the high loss regions.
  • Figure 2: We compare qualitatively the standard, sparse, and hard versions of EM for GMM-based clustering. Data points are colored by their true labels, with x indicate noise points. Colored ellipses denote the data generating covariances; they are the same across the three panels. Black ellipses are level curves of the estimated GMM components and they illustrate the effect of E-step sparsity on the fit. Our sparse EM algorithm combines the best of both worlds: It is robust to the outliers while still retaining soft-assignments as in the original EM.
  • Figure 3: We vary $\rho \in \{ 0.1, 0.5, 0.9, 1.0, 1.1, 1.5, 2.0, 3.0\}$ and plot from left to right the resulting adjusted mutual information, adjusted Rand index, and silhouette score. In the right-most panel we quantify the E-step sparsity by counting the number of clusters assigned 0 mixing proportion at the last iteration of each EM run and averaging across samples. Since the total number of clusters is four, the greatest number of components assigned 0 proportion is three. The shaded regions represent standard errors from the mean values across 5 seeds.
  • Figure 4: Topic coherence scores for the standard NVDM and the FY VAE models.

Theorems & Definitions (2)

  • Proposition 1: Fenchel-Young VI and Max Tsallis Entropy Equivalence
  • proof