Table of Contents
Fetching ...

Vector Quantization using Gaussian Variational Autoencoder

Tongda Xu, Wendi Zheng, Jiajun He, Jose Miguel Hernandez-Lobato, Yan Wang, Ya-Qin Zhang, Jie Tang

TL;DR

This paper tackles the challenge of training discrete VQ-VAE representations by introducing Gaussian Quant (GQ), a training-free method that converts a Gaussian VAE into a VQ-VAE using a fixed Gaussian noise codebook and nearest-neighbor quantization of posterior means. The authors establish a theoretical link between codebook size and bits-back coding rate, and devise Target Divergence Constraint (TDC) to train Gaussian VAEs so per-dimension KL divergences align with the codebook bitrate. Empirically, GQ with TDC outperforms existing VQ-VAE baselines (e.g., VQGAN, FSQ, LFQ, BSQ) on UNet and ViT backbones across 0.25–1.00 bpp, and TDC also improves TokenBridge. The work provides a principled, efficient, and training-free pathway to high-quality discrete representations with strong rate-distortion performance and practical implications for autoregressive generation and compression.

Abstract

Vector quantized variational autoencoder (VQ-VAE) is a discrete auto-encoder that compresses images into discrete tokens. It is difficult to train due to discretization. In this paper, we propose a simple yet effective technique, dubbed Gaussian Quant (GQ), that converts a Gaussian VAE with certain constraint into a VQ-VAE without training. GQ generates random Gaussian noise as a codebook and finds the closest noise to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAE for effective GQ, named target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves upon previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE.

Vector Quantization using Gaussian Variational Autoencoder

TL;DR

This paper tackles the challenge of training discrete VQ-VAE representations by introducing Gaussian Quant (GQ), a training-free method that converts a Gaussian VAE into a VQ-VAE using a fixed Gaussian noise codebook and nearest-neighbor quantization of posterior means. The authors establish a theoretical link between codebook size and bits-back coding rate, and devise Target Divergence Constraint (TDC) to train Gaussian VAEs so per-dimension KL divergences align with the codebook bitrate. Empirically, GQ with TDC outperforms existing VQ-VAE baselines (e.g., VQGAN, FSQ, LFQ, BSQ) on UNet and ViT backbones across 0.25–1.00 bpp, and TDC also improves TokenBridge. The work provides a principled, efficient, and training-free pathway to high-quality discrete representations with strong rate-distortion performance and practical implications for autoregressive generation and compression.

Abstract

Vector quantized variational autoencoder (VQ-VAE) is a discrete auto-encoder that compresses images into discrete tokens. It is difficult to train due to discretization. In this paper, we propose a simple yet effective technique, dubbed Gaussian Quant (GQ), that converts a Gaussian VAE with certain constraint into a VQ-VAE without training. GQ generates random Gaussian noise as a codebook and finds the closest noise to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAE for effective GQ, named target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves upon previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE.

Paper Structure

This paper contains 40 sections, 2 theorems, 38 equations, 5 figures, 22 tables, 2 algorithms.

Key Result

Theorem 1

Denote the mean and standard deviation of $q(Z_i|X=x)$ as $\mu_i$ and $\sigma_i$, respectively. Assuming that the product and sum satisfy $|\mu_i\sigma_i| \leq c_1$ and $|\mu_i| + |\sigma_i| \leq c_2$, the probability of a quantization error $|\hat{z}_i - \mu_i| \geq \sigma_i$ decays doubly exponent

Figures (5)

  • Figure 1: The rate-distortion performance on the ImageNet dataset demonstrates that GQ outperforms previous VQ-VAEs on both UNet and ViT architectures.
  • Figure 2: Qualitative results on ImageNet dataset and 0.25 bpp. Our GQ has most visually pleasing reconstruction result.
  • Figure 3: A visualization of large quantization error lowerbound and upperbound with ImageNet validation dataset.
  • Figure 4: The t-NSE visualization of latent of GQ vs. unquantized Gaussian VAE. It is shown that the latent before and after quantization are quite similar.
  • Figure 5: Qualitative results on ImageNet dataset and 0.25 bpp. None of those approaches correctly reconstruct the plate number.

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • proof
  • proof