Table of Contents
Fetching ...

$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence

Juno Kim, Jaehyuk Kwon, Mincheol Cho, Hyunjong Lee, Joong-Ho Won

TL;DR

The paper tackles over-regularization in VAEs by introducing a heavy-tailed framework, the t^3VAE, which uses Student's $t$-distributions for the prior, encoder, and decoder to form a power-form joint model. It replaces the traditional ELBO-KL objective with a $oldsymbol{ ext{gamma-loss}}$ based on $oldsymbol{ ext{gamma-power}}$ divergence, with a single hyperparameter $ u$ that governs regularization strength and tail behavior via the coupling $oldsymbol{ ext{γ}}=- rac{2}{ u+n+m}$. The approach is extended to a Bayesian hierarchy as $t^3$HVAE, enabling deeper heavy-tailed representations. Empirically, t^3VAE excels at learning and generating data from heavy-tailed and low-density regions, achieving superior tail coverage on synthetic data and leading performance on CelebA and imbalanced CIFAR-100, with training costs comparable to standard Gaussian VAEs. This framework offers a principled, information-geometric route to robust, tail-aware generative modeling with practical implications for real-world, long-tailed data distributions.

Abstract

The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose $t^3$VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with $γ$-power divergence, a natural alternative for power families. $t^3$VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that $t^3$VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets.

$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence

TL;DR

The paper tackles over-regularization in VAEs by introducing a heavy-tailed framework, the t^3VAE, which uses Student's -distributions for the prior, encoder, and decoder to form a power-form joint model. It replaces the traditional ELBO-KL objective with a based on divergence, with a single hyperparameter that governs regularization strength and tail behavior via the coupling . The approach is extended to a Bayesian hierarchy as HVAE, enabling deeper heavy-tailed representations. Empirically, t^3VAE excels at learning and generating data from heavy-tailed and low-density regions, achieving superior tail coverage on synthetic data and leading performance on CelebA and imbalanced CIFAR-100, with training costs comparable to standard Gaussian VAEs. This framework offers a principled, information-geometric route to robust, tail-aware generative modeling with practical implications for real-world, long-tailed data distributions.

Abstract

The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with -power divergence, a natural alternative for power families. VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets.
Paper Structure (36 sections, 5 theorems, 83 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 36 sections, 5 theorems, 83 equations, 7 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

$\gamma$-power divergence as defined in Equations cgammadef-dgammadef for $\gamma\in (-1,0)\cup (0,\infty)$ is a divergence on the finite $\gamma$-entropy submanifold $\{p\in\mathcal{M}: \left\lVert p\right\rVert_{1+\gamma}<\infty\}$ of $\mathcal{M}$.

Figures (7)

  • Figure 1: (a) Dependency of regularization on $\Sigma_\phi(x)$ when $m=n=1$, $\sigma=1$ (left); (b) graph of the alternative prior scale $\tau$ against $\nu$ (middle), (c) graph of the regularizer coefficient $\alpha$ against $\nu$ (right).
  • Figure 2: Log-histograms of samples generated from $t^3$VAE ($\nu = 9,12,15,18,21$), Gaussian VAE, $\beta$-VAE ($\beta=0.1$), Student-$t$ VAE, DE-VAE and VAE-st. Solid lines illustrate the true density $p_\text{heavy}$.
  • Figure 3: Original and reconstructed images by $t^3$VAE ($\nu = 10$), Gaussian VAE, VAE with $\kappa = 1.5$, and Tilted VAE ($\tau = 50$).
  • Figure 3: Generation FID scores for CelebA. $\blacktriangleleft$ Figure 4: Generated CelebA example images.
  • Figure 4: Original and reconstructed images by $t^3$HVAE ($\nu = 10$) and HVAE.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Proposition 5
  • proof