Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
Yongxin Zhu, Bocheng Li, Yifei Xin, Zhihua Xia, Linli Xu
TL;DR
This work identifies representation collapse in vector quantized models as stemming from disjoint codebook optimization, where only a subset of code vectors is updated during training. It proposes SimVQ, a simple method that reparameterizes the codebook through a learnable latent linear transformation, allowing joint optimization of the latent space and preventing collapse without reducing model capacity. Across image and audio experiments, SimVQ achieves near-complete codebook utilization, superior reconstruction metrics, and robust scalability to very large codebooks, while offering favorable memory efficiency. The results suggest that updating the latent space rather than individual codes is a key principle for stabilizing VQ representations and enabling scalable discrete tokenization for multimodal learning.
Abstract
Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, which compromises model capacity and fails to fully solve the problem. We identify the root cause as disjoint codebook optimization, where only a few code vectors are updated via gradient descent. To fix this, we propose \textbf{Sim}ple\textbf{VQ}, which reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the \textit{entire linear space} rather than nearest \textit{individual code vectors}. Although the multiplication of two linear matrices is equivalent to applying a single linear layer, this simple approach effectively prevents collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures. The code is available at https://github.com/youngsheen/SimVQ.
