LASERS: LAtent Space Encoding for Representations with Sparsity for Generative Modeling
Xin Li, Anand Sarwate
TL;DR
This work challenges the premise that latent-space discretization is essential by introducing sparse dictionary learning as a latent-space bottleneck for VAE/GAN models. By representing each latent feature as a sparse combination of learned dictionary atoms, the DL-VAE and DL-GAN achieve more expressive latent spaces, mitigate codebook collapse, and improve reconstruction quality across multiple datasets. The authors provide a detailed algorithmic framework (Batch-OMP, online dictionary updates) and demonstrate strong performance in downstream tasks, including super-resolution, inpainting, and reshaping the latent space of Stable Diffusion. The approach offers a versatile alternative to vector quantization that leverages lossy compression in a structured, learnable latent space with practical benefits for real-world generative modeling.
Abstract
Learning compact and meaningful latent space representations has been shown to be very useful in generative modeling tasks for visual data. One particular example is applying Vector Quantization (VQ) in variational autoencoders (VQ-VAEs, VQ-GANs, etc.), which has demonstrated state-of-the-art performance in many modern generative modeling applications. Quantizing the latent space has been justified by the assumption that the data themselves are inherently discrete in the latent space (like pixel values). In this paper, we propose an alternative representation of the latent space by relaxing the structural assumption than the VQ formulation. Specifically, we assume that the latent space can be approximated by a union of subspaces model corresponding to a dictionary-based representation under a sparsity constraint. The dictionary is learned/updated during the training process. We apply this approach to look at two models: Dictionary Learning Variational Autoencoders (DL-VAEs) and DL-VAEs with Generative Adversarial Networks (DL-GANs). We show empirically that our more latent space is more expressive and has leads to better representations than the VQ approach in terms of reconstruction quality at the expense of a small computational overhead for the latent space computation. Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space. We confirm this hypothesis by showing that our sparse representations also address the codebook collapse issue as found common in VQ-family models.
