Table of Contents
Fetching ...

Learning multi-modal generative models with permutation-invariant encoders and tighter variational objectives

Marcel Hirt, Domenico Campolo, Victoria Leong, Juan-Pablo Ortega

TL;DR

The paper introduces a tighter, permutation-invariant variational objective for multi-modal VAEs that can be optimized with masking over modality subsets. By replacing fixed PoE/MoE aggregations with learnable permutation-invariant encoders (e.g., Sum-Pooling and Set Transformer) and introducing a second latent variable to capture cross-modal information, the approach yields tighter lower bounds on the multi-modal log-likelihood and improved identifiability. The authors provide an information-theoretic analysis and demonstrate through extensive experiments on linear and nonlinear models, including MNIST-SVHN-Text, that their method achieves higher log-likelihoods and better latent identifiability than traditional mixture-based bounds, while enabling flexible handling of missing modalities. The work highlights practical benefits for cross-modal generation and representation learning, and outlines avenues for incorporating more expressive priors and diffusion-based techniques in future work.

Abstract

Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research. Multi-modal Variational Autoencoders (VAEs) have been a popular generative model class that learns latent representations that jointly explain multiple modalities. Various objective functions for such models have been suggested, often motivated as lower bounds on the multi-modal data log-likelihood or from information-theoretic considerations. To encode latent variables from different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts (MoE) aggregation schemes have been routinely used and shown to yield different trade-offs, for instance, regarding their generative quality or consistency across multiple modalities. In this work, we consider a variational objective that can tightly approximate the data log-likelihood. We develop more flexible aggregation schemes that avoid the inductive biases in PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks. Our numerical experiments illustrate trade-offs for multi-modal variational objectives and various aggregation schemes. We show that our variational objective and more flexible aggregation models can become beneficial when one wants to approximate the true joint distribution over observed modalities and latent variables in identifiable models.

Learning multi-modal generative models with permutation-invariant encoders and tighter variational objectives

TL;DR

The paper introduces a tighter, permutation-invariant variational objective for multi-modal VAEs that can be optimized with masking over modality subsets. By replacing fixed PoE/MoE aggregations with learnable permutation-invariant encoders (e.g., Sum-Pooling and Set Transformer) and introducing a second latent variable to capture cross-modal information, the approach yields tighter lower bounds on the multi-modal log-likelihood and improved identifiability. The authors provide an information-theoretic analysis and demonstrate through extensive experiments on linear and nonlinear models, including MNIST-SVHN-Text, that their method achieves higher log-likelihoods and better latent identifiability than traditional mixture-based bounds, while enabling flexible handling of missing modalities. The work highlights practical benefits for cross-modal generation and representation learning, and outlines avenues for incorporating more expressive priors and diffusion-based techniques in future work.

Abstract

Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research. Multi-modal Variational Autoencoders (VAEs) have been a popular generative model class that learns latent representations that jointly explain multiple modalities. Various objective functions for such models have been suggested, often motivated as lower bounds on the multi-modal data log-likelihood or from information-theoretic considerations. To encode latent variables from different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts (MoE) aggregation schemes have been routinely used and shown to yield different trade-offs, for instance, regarding their generative quality or consistency across multiple modalities. In this work, we consider a variational objective that can tightly approximate the data log-likelihood. We develop more flexible aggregation schemes that avoid the inductive biases in PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks. Our numerical experiments illustrate trade-offs for multi-modal variational objectives and various aggregation schemes. We show that our variational objective and more flexible aggregation models can become beneficial when one wants to approximate the true joint distribution over observed modalities and latent variables in identifiable models.
Paper Structure (69 sections, 5 theorems, 120 equations, 11 figures, 20 tables, 2 algorithms)

This paper contains 69 sections, 5 theorems, 120 equations, 11 figures, 20 tables, 2 algorithms.

Key Result

Proposition 1

For any $\mathcal{S} \in \mathcal{P}(\mathcal{M})$, we have where $q^\star(x_\mathcal{S}|z) = q_\phi(x_\mathcal{S},z)/q_{\phi}^{\text{agg}}(z)$. Moreover, for fixed $x_{\mathcal{S}}$, where $q^\star(x_{\setminus \mathcal{S}}|z,x_\mathcal{S}) = q_{\phi}(z,x_{\setminus \mathcal{S}}|x_\mathcal{S})/ q^{\text{agg}}_{\phi,\setminus \mathcal{S}}(z|x_\mathcal{S}) = p_d(x_{\setminus \mathcal{S}} | x_\mat

Figures (11)

  • Figure 1: Reconstruction or cross-prediction of modalities in \ref{['fig:recon_mixture']} and \ref{['fig:recon_masked']} for a mixture-based bound and our objective, respectively. The mixture-based bound resorts to a single latent variable $Z \sim q_\phi(\cdot| x_\mathcal{S})$ that encodes information from a modality subset $x_\mathcal{S}$ and is trained to reconstruct the conditioning modalities $x_\mathcal{S}$, as well as to predict the masked modalities $x_{\setminus \mathcal{S}}$. Our objective relies on two latent variables $Z_\mathcal{S} \sim q_\phi(\cdot| x_\mathcal{S})$ and $Z_\mathcal{M} \sim q_\phi(\cdot| x_\mathcal{S}, x_{\setminus \mathcal{S}})$, where $Z_\mathcal{S}$ is learned to reconstruct all its conditioning modalities, with $Z_\mathcal{M}$ learned to reconstruct the remaining modalities. KL regularization terms in \ref{['fig:reg_mixture']} and \ref{['fig:reg_masked']} for a mixture-based bound and our objective, respectively. The mixture-based bound aims to minimize the KL divergence between the encoding distribution given a modality subset $x_{\mathcal{S}}$ and a prior distribution. Our objective additionally aims to minimize the KL divergence between the encoding distribution given all modalities relative to the encoding distribution of a modality subset $x_{\mathcal{S}}$.
  • Figure 2: Illustration of multi-modal aggregation schemes. All encoding schemes first apply modality-specific encoders to each individual modality. A PoE model \ref{['fig:poe']} aggregates the outputs from the modality-specific encoders into a single Gaussian distribution that results from a multiplication of the corresponding uni-modal Gaussian densities. An MoE model \ref{['fig:moe']} assumes an equally weighted Gaussian mixture distribution comprised of the uni-modal Gaussian densities. Our new aggregation schemes allow for learning permutation-invariant fusion models: A Sum-Pooling or Deep Set model \ref{['fig:sum']} applies the same function $g$ to the encoded features $h_s$, $s \in \{T,A,I\}$, before summing them up and using a non-linear projection $\rho$ to the parameters of a Gaussian distribution. A Self-Attention model \ref{['fig:selfattention']} differs from the Sum-Pooling approach by applying self-attention layers or transformer layers before summing up the features, thereby accounting for pairwise interactions between the encoded modalities. Our newly introduced schemes allow for encoding only a modality subset by using standard masking operations.
  • Figure 3: Continuous data modality in (a) and reconstructions using different bounds and fusion models in (b)-(e). The true latent variables are shown in (f), with the inferred latent variables in (g)-(j) with a linear transformation indeterminacy. Labels are color-coded.
  • Figure 4: Conditional generation for different aggregation schemes and bounds and shared latent variables. The first column is the conditioned modality. The next three columns are the generated modalities using a SumPooling aggregation, followed by the three columns for a SelfAttention aggregation, followed by PoE+, and lastly MoE+.
  • Figure 5: Rate and distortion terms for MNIST-SVHN-Text with shared latent variables ($\beta=1$) for our proposed objective ('Masked') and the 'Mixture' based bound.
  • ...and 6 more figures

Theorems & Definitions (31)

  • Proposition 1: Marginal and conditional distribution matching
  • Corollary 2: Multi-modal log-likelihood approximation
  • Remark 3: Log-Likelihood approximation and Empirical Bayes
  • Remark 4: Prior-hole problem and Bayes or conditional consistency
  • Remark 5: Variational gap for mixture-based bounds
  • Remark 6: Optimization, multi-task learning and the choice of $\rho$
  • Lemma 7: Variational bounds on the conditional mutual information
  • Corollary 8: Lagrangian relaxation
  • Remark 9: Mixture-based variational bound
  • Remark 10: Optimal variational distributions
  • ...and 21 more