Table of Contents
Fetching ...

VP-VAE: Rethinking Vector Quantization via Adaptive Vector Perturbation

Linwei Zhai, Han Ding, Mingzhi Lin, Cui Zhao, Fei Wang, Ge Wang, Wang Zhi, Wei Xi

TL;DR

VP-VAE (Vector Perturbation VAE), a novel paradigm that decouples representation learning from discretization by eliminating the need for an explicit codebook during training, is proposed and FSP (Finite Scalar Perturbation), a lightweight variant of VP-VAE is derived that provides a unified theoretical explanation and a practical improvement for FSQ-style fixed quantizers.

Abstract

Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental to modern generative modeling, yet they often suffer from training instability and "codebook collapse" due to the inherent coupling of representation learning and discrete codebook optimization. In this paper, we propose VP-VAE (Vector Perturbation VAE), a novel paradigm that decouples representation learning from discretization by eliminating the need for an explicit codebook during training. Our key insight is that, from the neural network's viewpoint, performing quantization primarily manifests as injecting a structured perturbation in latent space. Accordingly, VP-VAE replaces the non-differentiable quantizer with distribution-consistent and scale-adaptive latent perturbations generated via Metropolis--Hastings sampling. This design enables stable training without a codebook while making the model robust to inference-time quantization error. Moreover, under the assumption of approximately uniform latent variables, we derive FSP (Finite Scalar Perturbation), a lightweight variant of VP-VAE that provides a unified theoretical explanation and a practical improvement for FSQ-style fixed quantizers. Extensive experiments on image and audio benchmarks demonstrate that VP-VAE and FSP improve reconstruction fidelity and achieve substantially more balanced token usage, while avoiding the instability inherent to coupled codebook training.

VP-VAE: Rethinking Vector Quantization via Adaptive Vector Perturbation

TL;DR

VP-VAE (Vector Perturbation VAE), a novel paradigm that decouples representation learning from discretization by eliminating the need for an explicit codebook during training, is proposed and FSP (Finite Scalar Perturbation), a lightweight variant of VP-VAE is derived that provides a unified theoretical explanation and a practical improvement for FSQ-style fixed quantizers.

Abstract

Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental to modern generative modeling, yet they often suffer from training instability and "codebook collapse" due to the inherent coupling of representation learning and discrete codebook optimization. In this paper, we propose VP-VAE (Vector Perturbation VAE), a novel paradigm that decouples representation learning from discretization by eliminating the need for an explicit codebook during training. Our key insight is that, from the neural network's viewpoint, performing quantization primarily manifests as injecting a structured perturbation in latent space. Accordingly, VP-VAE replaces the non-differentiable quantizer with distribution-consistent and scale-adaptive latent perturbations generated via Metropolis--Hastings sampling. This design enables stable training without a codebook while making the model robust to inference-time quantization error. Moreover, under the assumption of approximately uniform latent variables, we derive FSP (Finite Scalar Perturbation), a lightweight variant of VP-VAE that provides a unified theoretical explanation and a practical improvement for FSQ-style fixed quantizers. Extensive experiments on image and audio benchmarks demonstrate that VP-VAE and FSP improve reconstruction fidelity and achieve substantially more balanced token usage, while avoiding the instability inherent to coupled codebook training.
Paper Structure (40 sections, 19 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 40 sections, 19 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Codebook utilization during training. CVU curves for different methods on image reconstruction ($K{=}1024$). VQ-VAE and FSQ exhibit an initial rise followed by a decline. VP-VAE and FSP maintain stable, high utilization throughout training.
  • Figure 2: Output distributions of fixed quantization schemes. Given a uniform latent distribution, we compare the quantized output distributions produced by FSQ, FSQ with noise, symmetric FSQ with noise, and FSP. They are all configured with $L{=}4$ quantization levels. FSP produces a more uniform output distribution, aligning with the Lloyd--Max optimality principle.