Table of Contents
Fetching ...

Multiple Invertible and Partial-Equivariant Function for Latent Vector Transformation to Enhance Disentanglement in VAEs

Hee-Jun Jung, Jaehyoung Jeong, Kangil Kim

TL;DR

This work tackles unsupervised disentanglement in VAEs by introducing MIPE, a two-part framework that (i) uses Multiple Invertible and Partial-Equivariant Transformations (MIPE) to transform latent vectors through an invertible, symmetric-matrix exponential that preserves partial input-to-latent equivariance, and (ii) applies Exponential-Family Conversion (EF-conversion) to map the transformed latent variables to flexible, non-Gaussian priors via learnable natural parameters. The authors develop a principled loss structure, including EF similarity loss, KL divergence in the EF setting, and a KL calibration term with an implicit semantic mask, enabling the model to learn unknown latent distributions and improve disentanglement. They also demonstrate how multiple IPE units can be integrated into VAEs to substantially boost disentanglement metrics across dSprites, 3D Shapes, and 3D Cars datasets, with ablations validating the contributions of symmetry, invertibility, and EF-conversion. The results suggest that MIPE provides a practical, plug-in inductive bias for state-of-the-art disentanglement while offering a flexible prior framework for latent representations. Overall, MIPE advances unsupervised disentanglement by combining principled group-theoretic ideas with a probabilistic, expressive prior core, enabling more interpretable and reusable latent factors with broad applicability.

Abstract

Disentanglement learning is central to understanding and reusing learned representations in variational autoencoders (VAEs). Although equivariance has been explored in this context, effectively exploiting it for disentanglement remains challenging. In this paper, we propose a novel method, called Multiple Invertible and Partial-Equivariant Transformation (MIPE-Transformation), which integrates two main parts: (1) Invertible and Partial-Equivariant Transformation (IPE-Transformation), guaranteeing an invertible latent-to-transformed-latent mapping while preserving partial input-to-latent equivariance in the transformed latent space; and (2) Exponential-Family Conversion (EF-Conversion) to extend the standard Gaussian prior to an approximate exponential family via a learnable conversion. In experiments on the 3D Cars, 3D Shapes, and dSprites datasets, MIPE-Transformation improves the disentanglement performance of state-of-the-art VAEs.

Multiple Invertible and Partial-Equivariant Function for Latent Vector Transformation to Enhance Disentanglement in VAEs

TL;DR

This work tackles unsupervised disentanglement in VAEs by introducing MIPE, a two-part framework that (i) uses Multiple Invertible and Partial-Equivariant Transformations (MIPE) to transform latent vectors through an invertible, symmetric-matrix exponential that preserves partial input-to-latent equivariance, and (ii) applies Exponential-Family Conversion (EF-conversion) to map the transformed latent variables to flexible, non-Gaussian priors via learnable natural parameters. The authors develop a principled loss structure, including EF similarity loss, KL divergence in the EF setting, and a KL calibration term with an implicit semantic mask, enabling the model to learn unknown latent distributions and improve disentanglement. They also demonstrate how multiple IPE units can be integrated into VAEs to substantially boost disentanglement metrics across dSprites, 3D Shapes, and 3D Cars datasets, with ablations validating the contributions of symmetry, invertibility, and EF-conversion. The results suggest that MIPE provides a practical, plug-in inductive bias for state-of-the-art disentanglement while offering a flexible prior framework for latent representations. Overall, MIPE advances unsupervised disentanglement by combining principled group-theoretic ideas with a probabilistic, expressive prior core, enabling more interpretable and reusable latent factors with broad applicability.

Abstract

Disentanglement learning is central to understanding and reusing learned representations in variational autoencoders (VAEs). Although equivariance has been explored in this context, effectively exploiting it for disentanglement remains challenging. In this paper, we propose a novel method, called Multiple Invertible and Partial-Equivariant Transformation (MIPE-Transformation), which integrates two main parts: (1) Invertible and Partial-Equivariant Transformation (IPE-Transformation), guaranteeing an invertible latent-to-transformed-latent mapping while preserving partial input-to-latent equivariance in the transformed latent space; and (2) Exponential-Family Conversion (EF-Conversion) to extend the standard Gaussian prior to an approximate exponential family via a learnable conversion. In experiments on the 3D Cars, 3D Shapes, and dSprites datasets, MIPE-Transformation improves the disentanglement performance of state-of-the-art VAEs.

Paper Structure

This paper contains 50 sections, 6 theorems, 43 equations, 7 figures, 10 tables.

Key Result

Proposition 4.1

Any $\psi(\cdot) \in G_S$, notated as $\psi_{G_S}(\cdot)$, is equivariant to group $G_S$.

Figures (7)

  • Figure 1: The overall architecture of our proposed MIPET-VAE. The invertible and partial-equivariant function $\psi(\cdot)$ for latent-to-latent (L2L) transformation consists of a symmetric matrix exponential to be 1) invertible and 2) partial-equivariant. Then 3) EF conversion module converges the distribution of unrestricted $\hat{ {\bm{z}}}$ to be EF with $\mathcal{L}_{el}$ loss. Also, it applies KL divergence loss ($\mathcal{L}_{kl}$) between the transformed posterior and prior, which are expressed by the power density function of EF. In the last, EF conversion reduces the computational error ($\mathcal{L}_{cali}$) between approximated and true KL divergence.
  • Figure 2: The homogeneous space $\mathcal{Z}^\prime$ is induced by the encoder $q_\phi$, and the cardinality of the $\widehat{\mathcal{Z}}^{\,J}$ depends on the latent-to-latent (L2L) transformation.
  • Figure 3: Each square represents a value in the DCI matrix, which describes the relationship between the $i^{th}$ latent dimension and each factor. The size of each square is relative to the values within each row. The ideal case resembles a sparse matrix. The y-axis represents the factors of each dataset, while the x-axis corresponds to the latent vector dimensions. The number shown in each row of the matrix indicates the maximum value and standard deviation of that row. Higher maximum and standard deviation values suggest greater sparsity, indicating closer alignment with the ideal case.
  • Figure 4: Non-Gaussian posterior learned by an IPE module without intended guidance to a specific distribution in a toy setting. We compare VAE, MIPET-VAE, and MIPET-VAE without a semantic mask to assess how well each model captures the underlying distribution. We construct the VAE with a 4-layer Multi-Layer Perceptron (MLP) as the encoder and a single linear layer as the decoder. Blue plots are randomly sampled from a two-dimensional beta distribution, red plots are the posterior, and black plots are the output results.
  • Figure 5: Qualitative results on various datasets, which show the factors learned for each dimension of $z$.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Proposition 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Lemma B.1: Abelian Lie subgroup
  • proof : Proof sketch
  • Proposition B.2
  • Proposition B.3