Table of Contents
Fetching ...

Causal Structure and Representation Learning with Biomedical Applications

Caroline Uhler, Jiaqi Zhang

TL;DR

The paper tackles how to fuse representation learning with causal inference in biomedical settings, emphasizing multi-modal observational and perturbational data. It surveys causal discovery algorithms (PC, GAS, GSP) and their use with interventional data, highlighting identifiability limits under faithfulness and finite samples. It then develops causal representation learning (CRL) frameworks for single-modality, interventional, and multi-modal data, providing identifiability results and practical algorithms (e.g., leaf detection via Jacobians and constrained optimization) to recover latent causal variables and their relations, up to equivalence. The work applies these ideas to gene regulatory networks and Perturb-seq data, and advocates for causal experimental design to efficiently elicit informative perturbations and modalities, with broad implications for accelerating biomedical discovery.

Abstract

Massive data collection holds the promise of a better understanding of complex phenomena and, ultimately, better decisions. Representation learning has become a key driver of deep learning applications, as it allows learning latent spaces that capture important properties of the data without requiring any supervised annotations. Although representation learning has been hugely successful in predictive tasks, it can fail miserably in causal tasks including predicting the effect of a perturbation/intervention. This calls for a marriage between representation learning and causal inference. An exciting opportunity in this regard stems from the growing availability of multi-modal data (observational and perturbational, imaging-based and sequencing-based, at the single-cell level, tissue-level, and organism-level). We outline a statistical and computational framework for causal structure and representation learning motivated by fundamental biomedical questions: how to effectively use observational and perturbational data to perform causal discovery on observed causal variables; how to use multi-modal views of the system to learn causal variables; and how to design optimal perturbations.

Causal Structure and Representation Learning with Biomedical Applications

TL;DR

The paper tackles how to fuse representation learning with causal inference in biomedical settings, emphasizing multi-modal observational and perturbational data. It surveys causal discovery algorithms (PC, GAS, GSP) and their use with interventional data, highlighting identifiability limits under faithfulness and finite samples. It then develops causal representation learning (CRL) frameworks for single-modality, interventional, and multi-modal data, providing identifiability results and practical algorithms (e.g., leaf detection via Jacobians and constrained optimization) to recover latent causal variables and their relations, up to equivalence. The work applies these ideas to gene regulatory networks and Perturb-seq data, and advocates for causal experimental design to efficiently elicit informative perturbations and modalities, with broad implications for accelerating biomedical discovery.

Abstract

Massive data collection holds the promise of a better understanding of complex phenomena and, ultimately, better decisions. Representation learning has become a key driver of deep learning applications, as it allows learning latent spaces that capture important properties of the data without requiring any supervised annotations. Although representation learning has been hugely successful in predictive tasks, it can fail miserably in causal tasks including predicting the effect of a perturbation/intervention. This calls for a marriage between representation learning and causal inference. An exciting opportunity in this regard stems from the growing availability of multi-modal data (observational and perturbational, imaging-based and sequencing-based, at the single-cell level, tissue-level, and organism-level). We outline a statistical and computational framework for causal structure and representation learning motivated by fundamental biomedical questions: how to effectively use observational and perturbational data to perform causal discovery on observed causal variables; how to use multi-modal views of the system to learn causal variables; and how to design optimal perturbations.

Paper Structure

This paper contains 11 sections, 13 theorems, 12 equations, 7 figures.

Key Result

Lemma 2

If a distribution $\mathbb{P}$ is Markov with respect to a DAG $\mathcal{G}$, then d-separation implies conditional independence, i.e., $i\mathrel{\perp\!\!\!\perp} j\mid S$ in $\mathcal{G}$$\Longrightarrow$$X_i\mathrel{\perp\!\!\!\perp} X_j\mid X_S$ in $\mathbb{P}$.

Figures (7)

  • Figure 1: Illustrative examples and their respective causal graphs.
  • Figure 2: Illustrative examples of causal representation learning.
  • Figure 3: Illustrative examples of multi-modal causal representation learning.
  • Figure 4: Surfaces in $\mathbb{R}^3$ that correspond to unfaithful distributions for fully connected 3-node linear Gaussian causal models.
  • Figure 5: V-structure and Meek rules.
  • ...and 2 more figures

Theorems & Definitions (18)

  • Definition 1
  • Lemma 2
  • Definition 3
  • Definition 4
  • Example 5
  • Theorem 6
  • Theorem 7
  • Lemma 8: CI test for v-structure
  • Lemma 9: CI test for Meek Rule 1
  • Theorem 10
  • ...and 8 more