Table of Contents
Fetching ...

Disentanglement with Factor Quantized Variational Autoencoders

Gulcin Baykal, Melih Kandemir, Gozde Unal

TL;DR

The paper addresses unsupervised disentangled representation learning by proposing FactorQVAE, a discrete VAE that uses scalar quantization over a single global codebook and a total correlation regularizer to encourage independence among latent factors. It optimizes a differentiable, stochastic posterior via Gumbel-Softmax and incorporates a TC-based constraint into the ELBO, enabling stable training and improved disentanglement without ground-truth factor labels. Through extensive experiments on Shapes3D, Isaac3D, and MPI3D, FactorQVAE achieves superior DCI and InfoMEC scores while maintaining competitive reconstruction quality, with ablations demonstrating the benefits of scalar quantization and a global codebook over vector quantization and per-dimension codebooks. The work highlights the practical impact of combining discrete latent representations with factor-aware regularization and provides code for replication and further exploration.

Abstract

Disentangled representation learning aims to represent the underlying generative factors of a dataset in a latent representation independently of one another. In our work, we propose a discrete variational autoencoder (VAE) based model where the ground truth information about the generative factors are not provided to the model. We demonstrate the advantages of learning discrete representations over learning continuous representations in facilitating disentanglement. Furthermore, we propose incorporating an inductive bias into the model to further enhance disentanglement. Precisely, we propose scalar quantization of the latent variables in a latent representation with scalar values from a global codebook, and we add a total correlation term to the optimization as an inductive bias. Our method called FactorQVAE combines optimization based disentanglement approaches with discrete representation learning, and it outperforms the former disentanglement methods in terms of two disentanglement metrics (DCI and InfoMEC) while improving the reconstruction performance. Our code can be found at https://github.com/ituvisionlab/FactorQVAE.

Disentanglement with Factor Quantized Variational Autoencoders

TL;DR

The paper addresses unsupervised disentangled representation learning by proposing FactorQVAE, a discrete VAE that uses scalar quantization over a single global codebook and a total correlation regularizer to encourage independence among latent factors. It optimizes a differentiable, stochastic posterior via Gumbel-Softmax and incorporates a TC-based constraint into the ELBO, enabling stable training and improved disentanglement without ground-truth factor labels. Through extensive experiments on Shapes3D, Isaac3D, and MPI3D, FactorQVAE achieves superior DCI and InfoMEC scores while maintaining competitive reconstruction quality, with ablations demonstrating the benefits of scalar quantization and a global codebook over vector quantization and per-dimension codebooks. The work highlights the practical impact of combining discrete latent representations with factor-aware regularization and provides code for replication and further exploration.

Abstract

Disentangled representation learning aims to represent the underlying generative factors of a dataset in a latent representation independently of one another. In our work, we propose a discrete variational autoencoder (VAE) based model where the ground truth information about the generative factors are not provided to the model. We demonstrate the advantages of learning discrete representations over learning continuous representations in facilitating disentanglement. Furthermore, we propose incorporating an inductive bias into the model to further enhance disentanglement. Precisely, we propose scalar quantization of the latent variables in a latent representation with scalar values from a global codebook, and we add a total correlation term to the optimization as an inductive bias. Our method called FactorQVAE combines optimization based disentanglement approaches with discrete representation learning, and it outperforms the former disentanglement methods in terms of two disentanglement metrics (DCI and InfoMEC) while improving the reconstruction performance. Our code can be found at https://github.com/ituvisionlab/FactorQVAE.
Paper Structure (17 sections, 13 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 13 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: At the first stage (yellow background), an input $x$ is encoded into a latent representation $z_e(x)$ by the encoder $\mathcal{E}_{\theta_1}$, followed by some nonlinear operations $\mathcal{E}_{\theta_2}$. Each latent variable in $z_e(x)$ is quantized with the colored scalars from the codebook $\mathcal{M}$ whose indices $z$ are sampled based on the distances $\mathcal{R}$ between $z_e(x)$ and $\mathcal{M}$. The quantized latent representation $z_q(x)$ is transformed by nonlinear operations $\mathcal{D}_{\phi_2}$, and fed into the decoder $\mathcal{D}_{\phi_1}$ to reconstruct $x$. The discriminator $\mathcal{C}_\psi$ outputs log probabilities that its input is sampled from $q(z)$ rather than from $\bar{q}(z)$. At second stage (pink background), a new data batch $x'$ is sampled. Permuter $\mathcal{P}$ permutes the one-hot indices $z'$ across the latent dimensions, and yields $z'_{perm}$ (best viewed in PDF with zoom).
  • Figure 2: Latent traversal on the same image with QLAE and FactorQVAE for Shapes3D dataset. Each row $i$ (labeled as $\mathbf{z_i}$) shows the result of manipulating the $i^{th}$ latent, with the last 8 columns showing interpolations. For QLAE, the $i^{th}$ latent is intervened on with a linear interpolation between the minimum and maximum values in the corresponding $i^{th}$ codebook while it is intervened on with a linear interpolation between the minimum and maximum values in the global codebook for FactorQVAE. For FactorQVAE, rows 1 and 2 control object hue, row 3 controls object shape, rows 4 and 10 control camera orientation, rows 5 and 11 control floor hue, rows 6 and 8 control wall hue, and rows 7 and 9 control object scale.
  • Figure 3: Latent traversal on the same image with FactorVAE and FactorQVAE for Isaac3D dataset. Each row $i$ (labeled as $\mathbf{z_i}$) shows the result of manipulating the $i^{th}$ latent, with the last 8 columns showing interpolations. For FactorVAE, the $i^{th}$ latent is intervened on with a linear interpolation between "original latent value - 3" and "original latent value + 3" while it is intervened on with a linear interpolation between the minimum and maximum values in the global codebook for FactorQVAE. For FactorQVAE, rows 1 and 6 control lighting direction, rows 3 and 10 control the robot’s vertical axis, rows 5 and 8 control object scale, rows 9 and 13 control wall color, rows 10 and 14 control object color, row 11 controls the robot’s horizontal axis, row 12 controls lighting intensity, row 16 controls camera height, and row 17 controls object shape.
  • Figure 4: Latent traversal on the same image with $\beta$-VAE and FactorQVAE for MPI3D dataset. Each row $i$ (labeled as $\mathbf{z_i}$) shows the result of manipulating the $i^{th}$ latent, with the last 8 columns showing interpolations. For $\beta$-VAE, the $i^{th}$ latent is intervened on with a linear interpolation between "original latent value - 3" and "original latent value + 3" while it is intervened on with a linear interpolation between the minimum and maximum values in the global codebook for FactorQVAE. For FactorQVAE, row 2 controls camera height, row 3 controls object color, row 5 controls object size, row 8 controls background color, and row 9 controls object shape.
  • Figure 5: Visualization of the NMI matrix which is used in InfoMEC calculation for Shapes3D dataset.
  • ...and 4 more figures