Table of Contents
Fetching ...

LASERS: LAtent Space Encoding for Representations with Sparsity for Generative Modeling

Xin Li, Anand Sarwate

TL;DR

This work challenges the premise that latent-space discretization is essential by introducing sparse dictionary learning as a latent-space bottleneck for VAE/GAN models. By representing each latent feature as a sparse combination of learned dictionary atoms, the DL-VAE and DL-GAN achieve more expressive latent spaces, mitigate codebook collapse, and improve reconstruction quality across multiple datasets. The authors provide a detailed algorithmic framework (Batch-OMP, online dictionary updates) and demonstrate strong performance in downstream tasks, including super-resolution, inpainting, and reshaping the latent space of Stable Diffusion. The approach offers a versatile alternative to vector quantization that leverages lossy compression in a structured, learnable latent space with practical benefits for real-world generative modeling.

Abstract

Learning compact and meaningful latent space representations has been shown to be very useful in generative modeling tasks for visual data. One particular example is applying Vector Quantization (VQ) in variational autoencoders (VQ-VAEs, VQ-GANs, etc.), which has demonstrated state-of-the-art performance in many modern generative modeling applications. Quantizing the latent space has been justified by the assumption that the data themselves are inherently discrete in the latent space (like pixel values). In this paper, we propose an alternative representation of the latent space by relaxing the structural assumption than the VQ formulation. Specifically, we assume that the latent space can be approximated by a union of subspaces model corresponding to a dictionary-based representation under a sparsity constraint. The dictionary is learned/updated during the training process. We apply this approach to look at two models: Dictionary Learning Variational Autoencoders (DL-VAEs) and DL-VAEs with Generative Adversarial Networks (DL-GANs). We show empirically that our more latent space is more expressive and has leads to better representations than the VQ approach in terms of reconstruction quality at the expense of a small computational overhead for the latent space computation. Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space. We confirm this hypothesis by showing that our sparse representations also address the codebook collapse issue as found common in VQ-family models.

LASERS: LAtent Space Encoding for Representations with Sparsity for Generative Modeling

TL;DR

This work challenges the premise that latent-space discretization is essential by introducing sparse dictionary learning as a latent-space bottleneck for VAE/GAN models. By representing each latent feature as a sparse combination of learned dictionary atoms, the DL-VAE and DL-GAN achieve more expressive latent spaces, mitigate codebook collapse, and improve reconstruction quality across multiple datasets. The authors provide a detailed algorithmic framework (Batch-OMP, online dictionary updates) and demonstrate strong performance in downstream tasks, including super-resolution, inpainting, and reshaping the latent space of Stable Diffusion. The approach offers a versatile alternative to vector quantization that leverages lossy compression in a structured, learnable latent space with practical benefits for real-world generative modeling.

Abstract

Learning compact and meaningful latent space representations has been shown to be very useful in generative modeling tasks for visual data. One particular example is applying Vector Quantization (VQ) in variational autoencoders (VQ-VAEs, VQ-GANs, etc.), which has demonstrated state-of-the-art performance in many modern generative modeling applications. Quantizing the latent space has been justified by the assumption that the data themselves are inherently discrete in the latent space (like pixel values). In this paper, we propose an alternative representation of the latent space by relaxing the structural assumption than the VQ formulation. Specifically, we assume that the latent space can be approximated by a union of subspaces model corresponding to a dictionary-based representation under a sparsity constraint. The dictionary is learned/updated during the training process. We apply this approach to look at two models: Dictionary Learning Variational Autoencoders (DL-VAEs) and DL-VAEs with Generative Adversarial Networks (DL-GANs). We show empirically that our more latent space is more expressive and has leads to better representations than the VQ approach in terms of reconstruction quality at the expense of a small computational overhead for the latent space computation. Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space. We confirm this hypothesis by showing that our sparse representations also address the codebook collapse issue as found common in VQ-family models.
Paper Structure (22 sections, 27 equations, 42 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 27 equations, 42 figures, 8 tables, 1 algorithm.

Figures (42)

  • Figure 1: Architecture of a generic autoencoder model with the compression bottleneck.
  • Figure 2: (a) The internal working mechanism of the Dictionary Learning Compression Bottleneck. Note that here each fiber of the latent sparse codes is of $K$ length, however, due to the sparse structure assumption, we can always represent the sparse codes in a sparse data structure, such as the Sparse COO, CSR, etc., data formats cspytorch, in which the fibers can be reduced to $S \ll K$ length; (b) How the encoder outputs and the dictionary atoms move towards each other during training.
  • Figure 3: High-level architectural overview of DL-GAN.
  • Figure 4: (a) The training evolution of the VQ-VAE model; Figure (b) The training evolution of the DL-VAE model. For both models we evaluate the codebook/dictionary perplexity and the reconstruction PSNR, both in a smoothed fashion using the Savitzky–Golay filter savgol.
  • Figure 5: Figures on the first row shows the top singular component from the VQ-VAE encoder output; Figures on the bottom row shows the top singular component from the early stage latent space reconstruction via the Vector Quantization bottleneck.
  • ...and 37 more figures