Table of Contents
Fetching ...

Multimodal Variational Autoencoder: a Barycentric View

Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Xiaotong Sun, Jin Yang, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras

TL;DR

The paper reframes multimodal VAE aggregation as a barycenter problem, enabling principled comparisons among PoE, MoE, and new approaches. It introduces WB-VAE, leveraging the 2-Wasserstein barycenter to preserve geometry across unimodal posteriors, and MWB-VAE, a mixture of Wasserstein barycenters that balances zero-forcing and mass-covering. Theoretical analysis shows PoE and MoE correspond to reverse and forward KL barycenters, while the Wasserstein formulation yields a valid, geometry-aware joint posterior, especially tractable under Gaussian assumptions. Experiments on PolyMNIST, MNIST-SVHN-TEXT, and CelebA demonstrate that WB-VAE and MWB-VAE achieve strong latent representations and generation coherence, with MWB often outperforming state-of-the-art alternatives in challenging multimodal settings, highlighting the practical impact of a barycentric view for multimodal representation learning.

Abstract

Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), to for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.

Multimodal Variational Autoencoder: a Barycentric View

TL;DR

The paper reframes multimodal VAE aggregation as a barycenter problem, enabling principled comparisons among PoE, MoE, and new approaches. It introduces WB-VAE, leveraging the 2-Wasserstein barycenter to preserve geometry across unimodal posteriors, and MWB-VAE, a mixture of Wasserstein barycenters that balances zero-forcing and mass-covering. Theoretical analysis shows PoE and MoE correspond to reverse and forward KL barycenters, while the Wasserstein formulation yields a valid, geometry-aware joint posterior, especially tractable under Gaussian assumptions. Experiments on PolyMNIST, MNIST-SVHN-TEXT, and CelebA demonstrate that WB-VAE and MWB-VAE achieve strong latent representations and generation coherence, with MWB often outperforming state-of-the-art alternatives in challenging multimodal settings, highlighting the practical impact of a barycentric view for multimodal representation learning.

Abstract

Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), to for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.
Paper Structure (36 sections, 3 theorems, 26 equations, 12 figures, 4 tables)

This paper contains 36 sections, 3 theorems, 26 equations, 12 figures, 4 tables.

Key Result

Lemma 1

In the context of multimodal VAE, we seek to find a barycenter $\Tilde{q}(\bm{z} | \bm{X}_{1:M})$ that can aggregate the unimodal inference distributions $\{q_{\phi_m}(\bm{z}|\bm{x}_m)\}_{m=1}^M$ to approximate the true joint posterior $p_{\theta}(\bm{z} | \bm{X}_{1:M})$:

Figures (12)

  • Figure 1: The overview of a multimodal VAE that takes $M$ modalities $\bm{X}_{1:M}=\{ \bm{x}_{j} \}_{j=1}^M$ as input and outputs the reconstructed input modalities $\Tilde{\bm{X}}_{1:M}=\{ \Tilde{\bm{x}}_{j}\}_{j=1}^M$. The multimodal VAE consists of $M$ probabilistic encoders $\{q_{\phi_j}(\bm{z}|\bm{x}_j)\}_{j=1}^{M}$ and decoders $\{p_{\theta_j}(\bm{x}_j|\bm{z})\}_{j=1}^{M}$.
  • Figure 2: Comparison of methods for aggregating the unimodal inference distributions ($\{q_{\phi_j}\}_{j=1}^M$) to approximate the joint posterior ($\Tilde{q}_\phi$): (a) PoE, (b) MoE, and (c) the proposed Wasserstein barycenter. In this illustrative example, we use two 1-dimensional Gaussian modalities ($M=2$) for a proof of concept.
  • Figure 3: Quantitative results on PolyMNIST as a function of the number of input modalities, averaged over all subsets of modalities of the respective size. Left: Linear classification accuracy of digits given the latent representation. Center: Coherence of conditionally generated samples that do not include input modalities. Right: Log-Likelihood of all generated modalities.
  • Figure 4: Conditionally generated images given the text on top of each column on bimodal CelebA using $\mathcal{MWB}$-VAE.
  • Figure S5: Conditionally generated samples of the first modality (from the second to the last rows) given the respective test example from the third modality (first row). For each column, we draw distinct samples from the approximate joint posterior, which should generate the same digits but be expected to show stylistic variations.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Lemma 1
  • Proposition 1
  • Theorem 1
  • Remark 1
  • Remark 2
  • Remark 3
  • proof
  • proof
  • proof