$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence
Juno Kim, Jaehyuk Kwon, Mincheol Cho, Hyunjong Lee, Joong-Ho Won
TL;DR
The paper tackles over-regularization in VAEs by introducing a heavy-tailed framework, the t^3VAE, which uses Student's $t$-distributions for the prior, encoder, and decoder to form a power-form joint model. It replaces the traditional ELBO-KL objective with a $oldsymbol{ ext{gamma-loss}}$ based on $oldsymbol{ ext{gamma-power}}$ divergence, with a single hyperparameter $ u$ that governs regularization strength and tail behavior via the coupling $oldsymbol{ ext{γ}}=-rac{2}{ u+n+m}$. The approach is extended to a Bayesian hierarchy as $t^3$HVAE, enabling deeper heavy-tailed representations. Empirically, t^3VAE excels at learning and generating data from heavy-tailed and low-density regions, achieving superior tail coverage on synthetic data and leading performance on CelebA and imbalanced CIFAR-100, with training costs comparable to standard Gaussian VAEs. This framework offers a principled, information-geometric route to robust, tail-aware generative modeling with practical implications for real-world, long-tailed data distributions.
Abstract
The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose $t^3$VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with $γ$-power divergence, a natural alternative for power families. $t^3$VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that $t^3$VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets.
