Table of Contents
Fetching ...

Score-Based Multimodal Autoencoder

Daniel Wesego, Pedram Rooshenas

TL;DR

This work tackles the challenge of coherently and efficiently generating multiple modalities by decoupling unimodal representation learning from multimodal fusion. It introduces a score-based model that learns the joint latent space of independently trained unimodal VAEs, combined with two coherence-guidance mechanisms (energy-based and contrastive) to ensure cross-modal alignment during inference. The approach achieves strong unconditional coherence and competitive conditional generation on polyphonic and high-dimensional datasets, and demonstrates robustness to black-box adversarial perturbations. While offering improved coherence and scalability to many modalities, it notes the computational cost of diffusion-based sampling and outlines directions for faster inference and broader guidance strategies.

Abstract

Multimodal Variational Autoencoders (VAEs) represent a promising group of generative models that facilitate the construction of a tractable posterior within the latent space given multiple modalities. Previous studies have shown that as the number of modalities increases, the generative quality of each modality declines. In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of independently trained unimodal VAEs using score-based models (SBMs). The role of the SBM is to enforce multimodal coherence by learning the correlation among the latent variables. Consequently, our model combines a better generative quality of unimodal VAEs with coherent integration across different modalities using the latent score-based model. In addition, our approach provides the best unconditional coherence.

Score-Based Multimodal Autoencoder

TL;DR

This work tackles the challenge of coherently and efficiently generating multiple modalities by decoupling unimodal representation learning from multimodal fusion. It introduces a score-based model that learns the joint latent space of independently trained unimodal VAEs, combined with two coherence-guidance mechanisms (energy-based and contrastive) to ensure cross-modal alignment during inference. The approach achieves strong unconditional coherence and competitive conditional generation on polyphonic and high-dimensional datasets, and demonstrates robustness to black-box adversarial perturbations. While offering improved coherence and scalability to many modalities, it notes the computational cost of diffusion-based sampling and outlines directions for faster inference and broader guidance strategies.

Abstract

Multimodal Variational Autoencoders (VAEs) represent a promising group of generative models that facilitate the construction of a tractable posterior within the latent space given multiple modalities. Previous studies have shown that as the number of modalities increases, the generative quality of each modality declines. In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of independently trained unimodal VAEs using score-based models (SBMs). The role of the SBM is to enforce multimodal coherence by learning the correlation among the latent variables. Consequently, our model combines a better generative quality of unimodal VAEs with coherent integration across different modalities using the latent score-based model. In addition, our approach provides the best unconditional coherence.
Paper Structure (31 sections, 11 equations, 39 figures, 11 tables, 2 algorithms)

This paper contains 31 sections, 11 equations, 39 figures, 11 tables, 2 algorithms.

Figures (39)

  • Figure 1: A variational or regularized auto-encoder will be used for each individual modality to get the latent representation which then will be used to train the score-based model which will allow the prediction of any modality given some or none. The auto-encoders are trained independently in the first stage and the respective $z$ of each modality will be used to train the score network.
  • Figure 2: Extended PolyMnist Dataset
  • Figure 3: Multiple conditionally generated samples for each digit from the third modality. Each column shows samples, from 0 to 9, generated conditionally given the remaining modalities.
  • Figure 4: Unconditional samples from 10 modalities using SBM-VAE. Columns represent different samples from each modality.
  • Figure 5: Unconditional coherence. The x-axis shows the number of coherent modalities and the y-axis shows the percentage of such coherent predicted modalities in the generated output
  • ...and 34 more figures