Table of Contents
Fetching ...

Propensity Score Alignment of Unpaired Multimodal Data

Johnny Xi, Jana Osea, Zuheng Xu, Jason Hartford

TL;DR

An analogy between potential outcomes in causal inference and potential views in multimodal observations is drawn, which allows us to use Rubin's framework to estimate a common space in which to match samples.

Abstract

Multimodal representation learning techniques typically rely on paired samples to learn common representations, but paired samples are challenging to collect in fields such as biology where measurement devices often destroy the samples. This paper presents an approach to address the challenge of aligning unpaired samples across disparate modalities in multimodal representation learning. We draw an analogy between potential outcomes in causal inference and potential views in multimodal observations, which allows us to use Rubin's framework to estimate a common space in which to match samples. Our approach assumes we collect samples that are experimentally perturbed by treatments, and uses this to estimate a propensity score from each modality, which encapsulates all shared information between a latent state and treatment and can be used to define a distance between samples. We experiment with two alignment techniques that leverage this distance -- shared nearest neighbours (SNN) and optimal transport (OT) matching -- and find that OT matching results in significant improvements over state-of-the-art alignment approaches in both a synthetic multi-modal setting and in real-world data from NeurIPS Multimodal Single-Cell Integration Challenge.

Propensity Score Alignment of Unpaired Multimodal Data

TL;DR

An analogy between potential outcomes in causal inference and potential views in multimodal observations is drawn, which allows us to use Rubin's framework to estimate a common space in which to match samples.

Abstract

Multimodal representation learning techniques typically rely on paired samples to learn common representations, but paired samples are challenging to collect in fields such as biology where measurement devices often destroy the samples. This paper presents an approach to address the challenge of aligning unpaired samples across disparate modalities in multimodal representation learning. We draw an analogy between potential outcomes in causal inference and potential views in multimodal observations, which allows us to use Rubin's framework to estimate a common space in which to match samples. Our approach assumes we collect samples that are experimentally perturbed by treatments, and uses this to estimate a propensity score from each modality, which encapsulates all shared information between a latent state and treatment and can be used to define a distance between samples. We experiment with two alignment techniques that leverage this distance -- shared nearest neighbours (SNN) and optimal transport (OT) matching -- and find that OT matching results in significant improvements over state-of-the-art alignment approaches in both a synthetic multi-modal setting and in real-world data from NeurIPS Multimodal Single-Cell Integration Challenge.
Paper Structure (35 sections, 3 theorems, 27 equations, 4 figures, 3 tables)

This paper contains 35 sections, 3 theorems, 27 equations, 4 figures, 3 tables.

Key Result

Proposition 3.1

In the model described by eqn::dgp, further assume that $f^{(e)}$ are injective for $e = 1, 2$. Then, the propensity scores in either modality is equal to the propensity score given by $Z$, i.e., $\pi(X^{(1)}) = \pi(X^{(2)}) = \pi(Z)$ as random variables. This implies for each $e = 1,2$, where $I$ is the mutual information. Furthermore, any other function $b(Z)$ satisfying $I(t, Z \mid b(Z)) = 0$

Figures (4)

  • Figure 1: Visualization of propensity score matching for two modalities (e.g., Microscopy images and RNA expression data). We first train classifiers to estimate the propensity score for samples from each modalities; the propensity score reveals the shared information $p(t|z_i)$, which allows us to re-pair the observed disconnected modalities. The matching procedure is then performed within each perturbation class based on the similarity bewteen the propensity scores.
  • Figure 2: VAE and classifier validation metrics on the CITE-seq dataset. Notice that validation cross-entropy inversely tracks the ground truth matching metrics, and thus can be used as a proxy in practical settings where the ground truth is unknown. The same pattern does not hold for the VAE yang2021multi, which we suspect is because reconstruction is largely irrelevant for matching.
  • Figure 3: OT matching allows for $t$ to have different effects on the modality specific information, here $u_i^{(1)}$ and $u_i^{(2)}$, as long as they can be written as transformations that preserve the relative order within modalities. Exact OT in 1-d always matches according to the relative ordering, and thus exhibits this type of "no crossing" behaviour shown in the figure on the left. The figure on the right shows a case where we would fail to correctly match across modalities because of the crossing shown in orange.
  • Figure 4: Example pair of synthetic images with the same underlying $z$.

Theorems & Definitions (7)

  • Proposition 3.1
  • Proposition 3.2
  • Definition A.1
  • Lemma B.1
  • proof
  • proof
  • proof