A Markov Random Field Multi-Modal Variational AutoEncoder

Fouad Oubari; Mohamed El Baha; Raphael Meunier; Rodrigue Décatoire; Mathilde Mougeot

A Markov Random Field Multi-Modal Variational AutoEncoder

Fouad Oubari, Mohamed El Baha, Raphael Meunier, Rodrigue Décatoire, Mathilde Mougeot

TL;DR

This work tackles modeling complex intermodal dependencies in multimodal data by infusing Markov Random Fields into both the prior and posterior of a multimodal variational autoencoder. It introduces a family of MRF-based VAEs, including a Gaussian MRF MVAE, an ALMRF MVAE for heavy-tailed data, and an NN-MRF MVAE with neural-network potentials, along with unified ELBO formulations and differentiable/inference schemes. Empirical results show competitive performance on PolyMNIST and superior intermodal coherence on a synthetic copula dataset, highlighting improved fidelity in joint modal generation and dependency modeling. The proposed framework advances multimodal generative modeling by enabling more faithful, tractable capture of complex cross-modal relationships with potential benefits for explainability and downstream applications.

Abstract

Recent advancements in multimodal Variational AutoEncoders (VAEs) have highlighted their potential for modeling complex data from multiple modalities. However, many existing approaches use relatively straightforward aggregating schemes that may not fully capture the complex dynamics present between different modalities. This work introduces a novel multimodal VAE that incorporates a Markov Random Field (MRF) into both the prior and posterior distributions. This integration aims to capture complex intermodal interactions more effectively. Unlike previous models, our approach is specifically designed to model and leverage the intricacies of these relationships, enabling a more faithful representation of multimodal data. Our experiments demonstrate that our model performs competitively on the standard PolyMNIST dataset and shows superior performance in managing complex intermodal dependencies in a specially designed synthetic dataset, intended to test intricate relationships.

A Markov Random Field Multi-Modal Variational AutoEncoder

TL;DR

Abstract

Paper Structure (62 sections, 3 theorems, 52 equations, 9 figures, 4 tables)

This paper contains 62 sections, 3 theorems, 52 equations, 9 figures, 4 tables.

Introduction
Development of the MRF MVAE
The GMRF MVAE
Extended Variants
Methodological Framework
Empirical Validation
Related Work
Multimodal VAEs
Markov Random Fields
Markov Random Fields in Machine Learning
Method
Variational Autoencoders
Markov Random Fields
MRF MVAE
Gaussian MRF MVAE
...and 47 more sections

Key Result

Proposition 1

Given a random vector $\mathbf{z} = (z_1, \dots, z_n) \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, where $\boldsymbol{\mu} = (\mu_1, \dots, \mu_n)$ with each $\mu_i$ of dimension $d$, and $\boldsymbol{\Sigma}$ is a block matrix with blocks $\Sigma_{ij}$ of dimension $d \times d$ represe where $\hat{\mu}_i$ and $\hat{\Sigma}_{ii}$ are computed as:

Figures (9)

Figure 1: The MRF MVAE architecture features each encoder producing a modality-specific mean $\mu_i$ and a diagonal block matrix $L_{i,i}$. These matrices constitute the diagonal blocks of $L$, the lower triangular matrix from the Cholesky decomposition of the covariance matrix $\Sigma = LL^{\top}$. The joint posterior distribution is characterized by the concatenated mean vector $\mu = (\mu_1,..,\mu_n)$ and the covariance matrix $\Sigma$, with off-diagonal elements of $L$ generated by a global encoder.
Figure 2: Illustrative comparisons of conditional sample generation using the PolyMNIST dataset. Displayed at the top row are the initial samples from one modality, followed by four samples generated conditionally for each remaining modality.
Figure 3: Qualitative results for the unconditional generations on the copula dataset. Each subplot visualizes joint distributions for each pair of coordinates $(X_i^1, X_j^1)$ and $(X_i^2, X_j^2)$ across the four two-dimensional modalities $(X_1, X_2, X_3, X_4)$. The true distributions are depicted in orange and the generated ones in blue.
Figure 4: Qualitative analysis of unconditional generations using the copula dataset. Each subplot displays the marginal distributions for each coordinate: $(X_i^1)$ on the left and $(X_i^2)$ on the right, across four two-dimensional modalities $(X_1, X_2, X_3, X_4)$. True distributions are depicted in orange and generated distributions in blue.
Figure 5: Qualitative results of unconditional generations from the copula dataset across three training iterations of the MVAE. Each subplot shows joint distributions for pairs of coordinates $(X_i^1, X_j^1)$ and $(X_i^2, X_j^2)$ across the four two-dimensional modalities $(X_1, X_2, X_3, X_4)$. The true distributions are shown in orange, and the MVAE-generated distributions are in blue.
...and 4 more figures

Theorems & Definitions (6)

Proposition 1
Lemma 1
Corollary 1: Generalization to $n$-vector Partitions
proof
proof
proof

A Markov Random Field Multi-Modal Variational AutoEncoder

TL;DR

Abstract

A Markov Random Field Multi-Modal Variational AutoEncoder

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (6)