Table of Contents
Fetching ...

Joint Multimodal Learning with Deep Generative Models

Masahiro Suzuki, Kotaro Nakayama, Yutaka Matsuo

TL;DR

The paper tackles the challenge of bidirectional cross-modal generation by learning a joint latent representation that couples multiple modalities through a joint multimodal variational autoencoder (JMVAE). It introduces JMVAE, which factors $p(oldsymbol{x},oldsymbol{w})$ via independent modality decoders conditioned on a shared latent variable, and extends it with JMVAE-kl to prevent sample collapse when modalities are missing by aligning single-modality encoders with the joint encoder, related to variation of information. Empirically, JMVAE achieves strong joint representations and competitive or superior log-likelihoods on MNIST and CelebA, with a GAN-enhanced variant (JMVAE-GAN) yielding sharper image generation on CelebA. The work demonstrates robust, bidirectional cross-modal generation between images and attributes and suggests scalable extensions to more modalities and richer VI-based connections. Overall, it advances multimodal deep generative modeling by enabling joint learning and robust inference across heterogeneous data sources.

Abstract

We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that our proposed method can obtain appropriate joint representation from multiple modalities and that it can generate and reconstruct them more properly than conventional VAEs. We further demonstrate that JMVAE can generate multiple modalities bi-directionally.

Joint Multimodal Learning with Deep Generative Models

TL;DR

The paper tackles the challenge of bidirectional cross-modal generation by learning a joint latent representation that couples multiple modalities through a joint multimodal variational autoencoder (JMVAE). It introduces JMVAE, which factors via independent modality decoders conditioned on a shared latent variable, and extends it with JMVAE-kl to prevent sample collapse when modalities are missing by aligning single-modality encoders with the joint encoder, related to variation of information. Empirically, JMVAE achieves strong joint representations and competitive or superior log-likelihoods on MNIST and CelebA, with a GAN-enhanced variant (JMVAE-GAN) yielding sharper image generation on CelebA. The work demonstrates robust, bidirectional cross-modal generation between images and attributes and suggests scalable extensions to more modalities and richer VI-based connections. Overall, it advances multimodal deep generative modeling by enabling joint learning and robust inference across heterogeneous data sources.

Abstract

We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that our proposed method can obtain appropriate joint representation from multiple modalities and that it can generate and reconstruct them more properly than conventional VAEs. We further demonstrate that JMVAE can generate multiple modalities bi-directionally.

Paper Structure

This paper contains 23 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Various images and attributes generated from an input image. We used the CelebA dataset Eyebrows2015 to train and test models in this example. Each yellow box corresponds to different processes. All processes are estimated from a single generative model: the joint multimodal variational autoencoder (JMVAE), which is our proposed model.
  • Figure 2: (a) Graphical model of the JMVAE. Gray circles represent observed variables. The white one denotes a latent variable. (b) Two approaches to estimate encoders with a single input, $q(\mathbf{z}|\mathbf{x})$ and $q(\mathbf{z}|\mathbf{w})$, on the JMVAE: left, make modalities except an input modality missing ( JMVAE-zero); right, prepare encoders that have a single input and make them close to the JMVAE encoder ( JMVAE-kl).
  • Figure 3: Visualizations of 2-D latent representation. The network architectures are the same as those in Section \ref{['sec:quan']}, except that the dimension of the top hidden layer is forced into 2. Points with different colors correspond to the digit labels. These were sampled from $q(\mathbf{z}|\mathbf{x})$ in the VAE and $q(\mathbf{z}|\mathbf{x}, \mathbf{w})$ in both the CVAE and JMVAE. We used JMVAE-zero as the JMVAE.
  • Figure 4: (a) Generation of average faces and corresponding random faces. We first set all values of attributes $\{-1, 1\}$ randomly and designate them as Base. Then, we choose an attribute that we want to set (e.g., Male, Bald, Smiling) and change this value in Base to $2$ (or $-2$ if we want to set "Not"). Each column corresponds to same attribute according to legend. Average faces are generated from $p(\mathbf{x}|\mathbf{z}_{mean})$, where $\mathbf{z}_{mean}$ is a mean of $q(\mathbf{z}|\mathbf{w})$. Moreover, we can obtain various images conditioned on the same values of attributes such as $\mathbf{x}\sim p(\mathbf{x}|\mathbf{z})$, where $\mathbf{z}=\mathbf{z}_{mean}+{\boldsymbol \sigma}\odot{\boldsymbol \epsilon}$, ${\boldsymbol \epsilon}\sim \mathcal{N}(\mathbf{0},{\boldsymbol \zeta})$, and ${\boldsymbol \zeta}$ is the parameter which determines the range of variance. In this figure, we set ${\boldsymbol \zeta}=0.6$. Each row in random faces has the same ${\boldsymbol \epsilon}$. (b) PCA visualizations of latent representation. Colors indicate which attribute each sample is conditioned on.
  • Figure 5: Portraits of the Mona Lisa(upper) and Mozart(lower), generated their attributes, and reconstructed images conditioned on varied attributes, according to the legend. We cropped and resized it in the same way as CelebA. The procedure is as follows: generate the corresponding attributes $\mathbf{w}$ from an unlabeled image $\mathbf{x}$; generate an average face $\mathbf{x}_{mean}$ from the attributes $\mathbf{w}$; select attributes which we want to vary and change the values of these attributes; generate the changed average face $\mathbf{x}'_{mean}$ from the changed attributes; and obtain a changed reconstruction image $\mathbf{x}'$ by $\mathbf{x}+\mathbf{x}'_{mean}-\mathbf{x}_{mean}$.
  • ...and 2 more figures

Theorems & Definitions (1)

  • proof