Table of Contents
Fetching ...

Unity by Diversity: Improved Representation Learning in Multimodal VAEs

Thomas M. Sutter, Yang Meng, Andrea Agostini, Daphné Chopard, Norbert Fortin, Julia E. Vogt, Babak Shahbaba, Stephan Mandt

TL;DR

This work proposes a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior, which results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features.

Abstract

Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation. Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior. This approach results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features. In extensive experiments on multiple benchmark datasets and two challenging real-world datasets, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.

Unity by Diversity: Improved Representation Learning in Multimodal VAEs

TL;DR

This work proposes a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior, which results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features.

Abstract

Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation. Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior. This approach results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features. In extensive experiments on multiple benchmark datasets and two challenging real-world datasets, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.
Paper Structure (43 sections, 1 theorem, 16 equations, 20 figures, 3 tables)

This paper contains 43 sections, 1 theorem, 16 equations, 20 figures, 3 tables.

Key Result

Lemma 4.1

The expectation on the right-hand side of eq:ELBO-1 is maximized when for each $m\in \{1, \cdots, M\}$, the prior $h(\bm{z}_m | \bm{X})$ is equal to the aggregated posterior of a multimodal sample given on the first line of eq:mm_vamp_prior.

Figures (20)

  • Figure 1: Independent VAEs (\ref{['fig:exp_arch_ind_vaes']}) provide reconstructions for individual modalities but lack information sharing across modalities. Multimodal VAEs with joint posterior approximation (\ref{['fig:exp_arch_agg_vaes']}) aggregate unimodal posteriors into a joint posterior but may incur poor reconstruction quality. Our proposed MMVM VAE (\ref{['fig:exp_arch_mmvamp_vaes']}) enhances independent VAEs with a data-dependent prior, $h (\bm{z} \mid \bm{X})$, allowing soft-sharing of information between modalities while preserving modality-specific reconstructions.
  • Figure 2: Results on the benchmark datasets translated PolyMNIST, bimodal CelebA, and CUB. An optimal model would be in the top right corner with low reconstruction error and high classification performance. The proposed MMVM method either achieves a higher classification performance, latent representation (LR, \ref{['fig:exp_benchmarks_downstream_polymnist', 'fig:exp_benchmarks_downstream_celeba', 'fig:exp_benchmarks_downstream_cub']}) or coherence of generated samples (Coh, \ref{['fig:exp_benchmarks_coherence_polymnist', 'fig:exp_benchmarks_coherence_celeba', 'fig:exp_benchmarks_coherence_cub']}), with the same reconstruction loss or the same classification performance with lower reconstruction loss. Every point averages runs over multiple seeds and a specific $\beta$ value (see \ref{['sec:exp_benchmarks']}).
  • Figure 3: Results based on a memory experiment conducted on five rats, each regarded as a separate modality. We report the performance of the latent representation classification and the conditional generation coherence against the reconstruction loss for different $\beta$ values for the different VAE methods. Every point in the figures represents a specific $\beta$ value, where $\beta = (10^{-5}, 10^{-4}, 10^{-3}, 2.5\times 10^{-3}, 5\times 10^{-3}, 10^{-2})$. An optimal model would be in the top right corner.
  • Figure 4: Latent neural representation during a memory experiment. Each model's performance is evaluated based on its own optimal $\beta$ value (0.00001, 0.01, 0.00001, 0.001 for independent, AVG, MoPoE, and MMVM respectively) in terms of the unimodal latent representation classification accuracy according to \ref{['fig:exp_rats_downstream_recloss']}. Our method can distinguish the odor stimuli in the latent space with a clear separation of odors similar to MoPoE VAE (4 different colors). Conversely, unimodal and AVG models failed to combine multi-views as the odor separation only occurred within single views.
  • Figure 5: We compare the achieved values of the proposed objective $\mathcal{E}$ to the vanilla Autoencoder's negative mean squared error (MSE). Lowering the $\beta$ value of the regularizer $R$ in the objective (see \ref{['sec:method']}) approximates the negative MSE bound provided by the vanilla AE. This proves empirically that the negative MSE of the vanilla AE indeed upper bounds the proposed objective $\mathcal{E}$.
  • ...and 15 more figures

Theorems & Definitions (2)

  • Lemma 4.1
  • proof