Table of Contents
Fetching ...

Disentangling shared and private latent factors in multimodal Variational Autoencoders

Kaspar Märtens, Christopher Yau

TL;DR

This work investigates Multimodal Variational Autoencoders' capability to reliably perform disentanglement, and demonstrates limitations of existing models and proposes a modification how to make them more robust to modality-specific variation.

Abstract

Generative models for multimodal data permit the identification of latent factors that may be associated with important determinants of observed data heterogeneity. Common or shared factors could be important for explaining variation across modalities whereas other factors may be private and important only for the explanation of a single modality. Multimodal Variational Autoencoders, such as MVAE and MMVAE, are a natural choice for inferring those underlying latent factors and separating shared variation from private. In this work, we investigate their capability to reliably perform this disentanglement. In particular, we highlight a challenging problem setting where modality-specific variation dominates the shared signal. Taking a cross-modal prediction perspective, we demonstrate limitations of existing models, and propose a modification how to make them more robust to modality-specific variation. Our findings are supported by experiments on synthetic as well as various real-world multi-omics data sets.

Disentangling shared and private latent factors in multimodal Variational Autoencoders

TL;DR

This work investigates Multimodal Variational Autoencoders' capability to reliably perform disentanglement, and demonstrates limitations of existing models and proposes a modification how to make them more robust to modality-specific variation.

Abstract

Generative models for multimodal data permit the identification of latent factors that may be associated with important determinants of observed data heterogeneity. Common or shared factors could be important for explaining variation across modalities whereas other factors may be private and important only for the explanation of a single modality. Multimodal Variational Autoencoders, such as MVAE and MMVAE, are a natural choice for inferring those underlying latent factors and separating shared variation from private. In this work, we investigate their capability to reliably perform this disentanglement. In particular, we highlight a challenging problem setting where modality-specific variation dominates the shared signal. Taking a cross-modal prediction perspective, we demonstrate limitations of existing models, and propose a modification how to make them more robust to modality-specific variation. Our findings are supported by experiments on synthetic as well as various real-world multi-omics data sets.
Paper Structure (26 sections, 8 equations, 11 figures, 5 tables)

This paper contains 26 sections, 8 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: (A) Multimodal data with the underlying shared and private latent variables. We consider multimodal data (illustrated for tabular omics data, such as gene expression and methylation), where we distinguish between features that have been generated by shared latent factors and features that are driven by private modality-specific sources of variation. (B) Cross-modal prediction performance when increasing the number of private features. When gradually increasing the number of features that are driven by private latent factors (based on the example from Section \ref{['sec:synthetic']}), we observe that: 1) MMVAE and MoPoE consistently outperform MVAE, 2) the performance of all methods drops when increasing private features, but 3) our proposed modification MMVAE++ is significantly more robust than existing methods to high modality-specific variation.
  • Figure 2: Synthetic GP example: Cross-view prediction performance ($R^2$), separately for "shared" (red) and "private" (grey) feature sets.
  • Figure 3: CLL example: Cross-view $R^2$ for IGHV-related features for varying number of "other" genes, when predicting methylation from gene expression.
  • Figure 4: BRCA study: Cross-view prediction accuracy ($R^2$) separately for ER-related (red) and other (grey) feature sets, when predicting expression from methylation. Shown for (A) unsupervised models, and (B) supervised models that have access to the ER-status label.
  • Figure S1: Illustration of a partitioned latent space $\mathbf{z} = [\mathbf{z}^{\text{pr}_1}, \mathbf{z}^{\text{shared}}, \mathbf{z}^{\text{pr}_2}]$ in a multimodal VAE. The use of (a) shared and private latent variables allows for (b) cross-modal prediction via the shared component when one modality maybe missing at train or test time.
  • ...and 6 more figures