Table of Contents
Fetching ...

A Markov Random Field Multi-Modal Variational AutoEncoder

Fouad Oubari, Mohamed El Baha, Raphael Meunier, Rodrigue Décatoire, Mathilde Mougeot

TL;DR

This work tackles modeling complex intermodal dependencies in multimodal data by infusing Markov Random Fields into both the prior and posterior of a multimodal variational autoencoder. It introduces a family of MRF-based VAEs, including a Gaussian MRF MVAE, an ALMRF MVAE for heavy-tailed data, and an NN-MRF MVAE with neural-network potentials, along with unified ELBO formulations and differentiable/inference schemes. Empirical results show competitive performance on PolyMNIST and superior intermodal coherence on a synthetic copula dataset, highlighting improved fidelity in joint modal generation and dependency modeling. The proposed framework advances multimodal generative modeling by enabling more faithful, tractable capture of complex cross-modal relationships with potential benefits for explainability and downstream applications.

Abstract

Recent advancements in multimodal Variational AutoEncoders (VAEs) have highlighted their potential for modeling complex data from multiple modalities. However, many existing approaches use relatively straightforward aggregating schemes that may not fully capture the complex dynamics present between different modalities. This work introduces a novel multimodal VAE that incorporates a Markov Random Field (MRF) into both the prior and posterior distributions. This integration aims to capture complex intermodal interactions more effectively. Unlike previous models, our approach is specifically designed to model and leverage the intricacies of these relationships, enabling a more faithful representation of multimodal data. Our experiments demonstrate that our model performs competitively on the standard PolyMNIST dataset and shows superior performance in managing complex intermodal dependencies in a specially designed synthetic dataset, intended to test intricate relationships.

A Markov Random Field Multi-Modal Variational AutoEncoder

TL;DR

This work tackles modeling complex intermodal dependencies in multimodal data by infusing Markov Random Fields into both the prior and posterior of a multimodal variational autoencoder. It introduces a family of MRF-based VAEs, including a Gaussian MRF MVAE, an ALMRF MVAE for heavy-tailed data, and an NN-MRF MVAE with neural-network potentials, along with unified ELBO formulations and differentiable/inference schemes. Empirical results show competitive performance on PolyMNIST and superior intermodal coherence on a synthetic copula dataset, highlighting improved fidelity in joint modal generation and dependency modeling. The proposed framework advances multimodal generative modeling by enabling more faithful, tractable capture of complex cross-modal relationships with potential benefits for explainability and downstream applications.

Abstract

Recent advancements in multimodal Variational AutoEncoders (VAEs) have highlighted their potential for modeling complex data from multiple modalities. However, many existing approaches use relatively straightforward aggregating schemes that may not fully capture the complex dynamics present between different modalities. This work introduces a novel multimodal VAE that incorporates a Markov Random Field (MRF) into both the prior and posterior distributions. This integration aims to capture complex intermodal interactions more effectively. Unlike previous models, our approach is specifically designed to model and leverage the intricacies of these relationships, enabling a more faithful representation of multimodal data. Our experiments demonstrate that our model performs competitively on the standard PolyMNIST dataset and shows superior performance in managing complex intermodal dependencies in a specially designed synthetic dataset, intended to test intricate relationships.
Paper Structure (62 sections, 3 theorems, 52 equations, 9 figures, 4 tables)

This paper contains 62 sections, 3 theorems, 52 equations, 9 figures, 4 tables.

Key Result

Proposition 1

Given a random vector $\mathbf{z} = (z_1, \dots, z_n) \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, where $\boldsymbol{\mu} = (\mu_1, \dots, \mu_n)$ with each $\mu_i$ of dimension $d$, and $\boldsymbol{\Sigma}$ is a block matrix with blocks $\Sigma_{ij}$ of dimension $d \times d$ represe where $\hat{\mu}_i$ and $\hat{\Sigma}_{ii}$ are computed as:

Figures (9)

  • Figure 1: The MRF MVAE architecture features each encoder producing a modality-specific mean $\mu_i$ and a diagonal block matrix $L_{i,i}$. These matrices constitute the diagonal blocks of $L$, the lower triangular matrix from the Cholesky decomposition of the covariance matrix $\Sigma = LL^{\top}$. The joint posterior distribution is characterized by the concatenated mean vector $\mu = (\mu_1,..,\mu_n)$ and the covariance matrix $\Sigma$, with off-diagonal elements of $L$ generated by a global encoder.
  • Figure 2: Illustrative comparisons of conditional sample generation using the PolyMNIST dataset. Displayed at the top row are the initial samples from one modality, followed by four samples generated conditionally for each remaining modality.
  • Figure 3: Qualitative results for the unconditional generations on the copula dataset. Each subplot visualizes joint distributions for each pair of coordinates $(X_i^1, X_j^1)$ and $(X_i^2, X_j^2)$ across the four two-dimensional modalities $(X_1, X_2, X_3, X_4)$. The true distributions are depicted in orange and the generated ones in blue.
  • Figure 4: Qualitative analysis of unconditional generations using the copula dataset. Each subplot displays the marginal distributions for each coordinate: $(X_i^1)$ on the left and $(X_i^2)$ on the right, across four two-dimensional modalities $(X_1, X_2, X_3, X_4)$. True distributions are depicted in orange and generated distributions in blue.
  • Figure 5: Qualitative results of unconditional generations from the copula dataset across three training iterations of the MVAE. Each subplot shows joint distributions for pairs of coordinates $(X_i^1, X_j^1)$ and $(X_i^2, X_j^2)$ across the four two-dimensional modalities $(X_1, X_2, X_3, X_4)$. The true distributions are shown in orange, and the MVAE-generated distributions are in blue.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Lemma 1
  • Corollary 1: Generalization to $n$-vector Partitions
  • proof
  • proof
  • proof