Learning Multimodal Latent Generative Models with Energy-Based Prior

Shiyu Yuan; Jiali Cui; Hanao Li; Tian Han

Learning Multimodal Latent Generative Models with Energy-Based Prior

Shiyu Yuan, Jiali Cui, Hanao Li, Tian Han

TL;DR

This paper proposes a novel framework that integrates the multimodal latent generative model with the energy-based models, and results in a more expressive and informative prior, better-capturing of information across multiple modalities.

Abstract

Multimodal generative models have recently gained significant attention for their ability to learn representations across various modalities, enhancing joint and cross-generation coherence. However, most existing works use standard Gaussian or Laplacian distributions as priors, which may struggle to capture the diverse information inherent in multiple data types due to their unimodal and less informative nature. Energy-based models (EBMs), known for their expressiveness and flexibility across various tasks, have yet to be thoroughly explored in the context of multimodal generative models. In this paper, we propose a novel framework that integrates the multimodal latent generative model with the EBM. Both models can be trained jointly through a variational scheme. This approach results in a more expressive and informative prior, better-capturing of information across multiple modalities. Our experiments validate the proposed model, demonstrating its superior generation coherence.

Learning Multimodal Latent Generative Models with Energy-Based Prior

TL;DR

Abstract

Paper Structure (11 sections, 14 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 14 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Works
Methodology
Energy-based Prior for Multimodalities
Learning and Sampling
Experiments
Joint Coherence
Cross Coherence
Model Generalization
Ablation Studies
Conclusions and Future Work

Figures (5)

Figure 1: Qualitative Results of Joint Generation on MNIST-SVHN (top) and PolyMNIST (bottom))
Figure 2: Qualitative results of Joint Generation on PolyMNIST
Figure 3: Qualitative results of Cross Generation on PolyMNIST. From left to right, cross generation results are from (a) EBM prior (b) MMVAE$+$, (c) MoPoE, (d) mmJSD, and (e) MVTCAE.
Figure 4: Visualization of Markov chain transition on image synthesis and corresponding generated text. Left: visualization of Markov chain transition where left column images are generated by unimodal prior sample, and right column images are generated by EBM prior sample. Right: jointly generated text corresponding to each row in left figure. The sentence at the top is generated by unimodal prior sample, and the sentence at the bottom is generated by EBM prior sample.
Figure : Qualitative Results of Joint Generation on MNIST-SVHN (top) and PolyMNIST (bottom))

Learning Multimodal Latent Generative Models with Energy-Based Prior

TL;DR

Abstract

Learning Multimodal Latent Generative Models with Energy-Based Prior

Authors

TL;DR

Abstract

Table of Contents

Figures (5)