Table of Contents
Fetching ...

Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference

Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo

TL;DR

Experiments show that this approach improves inference performance, evidenced by higher linear classification accuracy and competitive cosine similarity, and enhances cross-modal generation, indicated by lower FID scores, and demonstrates that the method enhances inferred representations from unimodal inputs.

Abstract

Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities. A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number (2^M) of inference networks for all possible modality combinations. Mixture-based models simplify this by requiring only as many inference models as there are modalities, aggregating unimodal inferences. However, they suffer from information loss when modalities are missing. Alignment-based VAEs address this by aligning unimodal inference models with a multimodal model through minimizing the Kullback-Leibler (KL) divergence but face issues due to amortization gaps, which compromise inference accuracy. To tackle these problems, we introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework. This method overcomes information loss from missing modalities and minimizes the amortization gap by iteratively refining the multimodal inference using all available modalities. By aligning unimodal inference to this refined multimodal posterior, we achieve unimodal inferences that effectively incorporate multimodal information while requiring only unimodal inputs during inference. Experiments on benchmark datasets show that our approach improves inference performance, evidenced by higher linear classification accuracy and competitive cosine similarity, and enhances cross-modal generation, indicated by lower FID scores. This demonstrates that our method enhances inferred representations from unimodal inputs.

Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference

TL;DR

Experiments show that this approach improves inference performance, evidenced by higher linear classification accuracy and competitive cosine similarity, and enhances cross-modal generation, indicated by lower FID scores, and demonstrates that the method enhances inferred representations from unimodal inputs.

Abstract

Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities. A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number (2^M) of inference networks for all possible modality combinations. Mixture-based models simplify this by requiring only as many inference models as there are modalities, aggregating unimodal inferences. However, they suffer from information loss when modalities are missing. Alignment-based VAEs address this by aligning unimodal inference models with a multimodal model through minimizing the Kullback-Leibler (KL) divergence but face issues due to amortization gaps, which compromise inference accuracy. To tackle these problems, we introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework. This method overcomes information loss from missing modalities and minimizes the amortization gap by iteratively refining the multimodal inference using all available modalities. By aligning unimodal inference to this refined multimodal posterior, we achieve unimodal inferences that effectively incorporate multimodal information while requiring only unimodal inputs during inference. Experiments on benchmark datasets show that our approach improves inference performance, evidenced by higher linear classification accuracy and competitive cosine similarity, and enhances cross-modal generation, indicated by lower FID scores. This demonstrates that our method enhances inferred representations from unimodal inputs.

Paper Structure

This paper contains 17 sections, 22 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Proposed method (# of modality types $M = 2$). Red lines represent gradient propagation through backpropagation using the multimodal ELBO, while blue lines indicate inference updates using gradients of the mean ($\boldsymbol{\mu}$) and variance ($\boldsymbol{\sigma}$). Together, these form the process of multimodal iterative amortized inference.
  • Figure 2: Qualitative results of cross-modal generation on the MNIST-SVHN-Text dataset when applying multimodal iterative amortized inference to $q_{\phi_2}(\mathbf{z}|\mathbf{x}_2)$ (input modality $\mathbf{x}_2$ is Text, generated modality $\mathbf{x}_0$ is MNIST (left), and generated modality $\mathbf{x}_1$ is SVHN (right)). By increasing the number of iterations $T$, information from missing modalities is recovered, improving the performance of cross-modal generation.
  • Figure 3: Qualitative results of cross-modal generation on the CUB dataset when applying multimodal iterative amortized inference to $q_{\phi_1}(\mathbf{z}|\mathbf{x}_1)$ (input modality $\mathbf{x}_1$ is Text (shown in a blue box), generated modality $\mathbf{x}_0$ is Image. By increasing the number of iterations $T$, information from missing modalities is recovered, improving the performance of cross-modal generation.
  • Figure 4: Improvement in multimodal ELBO using multimodal iterative amortized inference to $q_{\phi_0}(\mathbf{z}|\mathbf{x}_0)$ (input modality $\mathbf{x}_0$ is MNIST, left), to $q_{\phi_1}(\mathbf{z}|\mathbf{x}_1)$ (input modality $\mathbf{x}_1$ is SVHN, middle) and to $q_{\phi_2}(\mathbf{z}|\mathbf{x}_2)$ (input modality $\mathbf{x}_2$ is Text, right) on the MNIST-SVHN-Text dataset. Image The dotted line represents the ELBO of the alignment source model (PoE).
  • Figure 5: Improvement in multimodal ELBO using multimodal iterative amortized inference to $q_{\phi_0}(\mathbf{z}|\mathbf{x}_0)$ (input modality $\mathbf{x}_0$ is Image, left) and to $q_{\phi_1}(\mathbf{z}|\mathbf{x}_1)$ (input modality $\mathbf{x}_1$ is Text, right) on the CUB dataset. Image The dotted line represents the ELBO of the alignment source model (PoE).
  • ...and 7 more figures