$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence

Juno Kim; Jaehyuk Kwon; Mincheol Cho; Hyunjong Lee; Joong-Ho Won

$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence

Juno Kim, Jaehyuk Kwon, Mincheol Cho, Hyunjong Lee, Joong-Ho Won

TL;DR

The paper tackles over-regularization in VAEs by introducing a heavy-tailed framework, the t^3VAE, which uses Student's $t$-distributions for the prior, encoder, and decoder to form a power-form joint model. It replaces the traditional ELBO-KL objective with a $oldsymbol{ ext{gamma-loss}}$ based on $oldsymbol{ ext{gamma-power}}$ divergence, with a single hyperparameter $ u$ that governs regularization strength and tail behavior via the coupling $oldsymbol{ ext{γ}}=-rac{2}{ u+n+m}$. The approach is extended to a Bayesian hierarchy as $t^3$HVAE, enabling deeper heavy-tailed representations. Empirically, t^3VAE excels at learning and generating data from heavy-tailed and low-density regions, achieving superior tail coverage on synthetic data and leading performance on CelebA and imbalanced CIFAR-100, with training costs comparable to standard Gaussian VAEs. This framework offers a principled, information-geometric route to robust, tail-aware generative modeling with practical implications for real-world, long-tailed data distributions.

Abstract

The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose $t^3$VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with $γ$-power divergence, a natural alternative for power families. $t^3$VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that $t^3$VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets.

$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence

TL;DR

The paper tackles over-regularization in VAEs by introducing a heavy-tailed framework, the t^3VAE, which uses Student's

-distributions for the prior, encoder, and decoder to form a power-form joint model. It replaces the traditional ELBO-KL objective with a

based on

divergence, with a single hyperparameter

that governs regularization strength and tail behavior via the coupling

. The approach is extended to a Bayesian hierarchy as

HVAE, enabling deeper heavy-tailed representations. Empirically, t^3VAE excels at learning and generating data from heavy-tailed and low-density regions, achieving superior tail coverage on synthetic data and leading performance on CelebA and imbalanced CIFAR-100, with training costs comparable to standard Gaussian VAEs. This framework offers a principled, information-geometric route to robust, tail-aware generative modeling with practical implications for real-world, long-tailed data distributions.

Abstract

VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with

-power divergence, a natural alternative for power families.

VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that

VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets.

Paper Structure (36 sections, 5 theorems, 83 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 36 sections, 5 theorems, 83 equations, 7 figures, 8 tables, 1 algorithm.

Introduction
Related Works
Theoretical Background
VAE as Joint Minimization
Information Geometry and $\gamma$-power Divergence
The $t^3$-Variational Autoencoder
Structure of the $t^3$VAE
$\gamma$-power Divergence Loss
$\nu$ Controls Regularization Strength
$t^3$HVAE: the Bayesian Hierarchy
Experiments
Learning Heavy-tailed Bimodal Distributions
Univariate dataset.
Bivariate dataset.
Learning High-dimensional Images
...and 21 more sections

Key Result

Proposition 1

$\gamma$-power divergence as defined in Equations cgammadef-dgammadef for $\gamma\in (-1,0)\cup (0,\infty)$ is a divergence on the finite $\gamma$-entropy submanifold $\{p\in\mathcal{M}: \left\lVert p\right\rVert_{1+\gamma}<\infty\}$ of $\mathcal{M}$.

Figures (7)

Figure 1: (a) Dependency of regularization on $\Sigma_\phi(x)$ when $m=n=1$, $\sigma=1$ (left); (b) graph of the alternative prior scale $\tau$ against $\nu$ (middle), (c) graph of the regularizer coefficient $\alpha$ against $\nu$ (right).
Figure 2: Log-histograms of samples generated from $t^3$VAE ($\nu = 9,12,15,18,21$), Gaussian VAE, $\beta$-VAE ($\beta=0.1$), Student-$t$ VAE, DE-VAE and VAE-st. Solid lines illustrate the true density $p_\text{heavy}$.
Figure 3: Original and reconstructed images by $t^3$VAE ($\nu = 10$), Gaussian VAE, VAE with $\kappa = 1.5$, and Tilted VAE ($\tau = 50$).
Figure 3: Generation FID scores for CelebA. $\blacktriangleleft$ Figure 4: Generated CelebA example images.
Figure 4: Original and reconstructed images by $t^3$HVAE ($\nu = 10$) and HVAE.
...and 2 more figures

Theorems & Definitions (10)

Proposition 1
proof
Proposition 2
proof
Proposition 3
proof
Proposition 4
proof
Proposition 5
proof

$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence

TL;DR

Abstract

$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (10)