Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation Experiments

Luka Kovačević; Izzy Newsham; Sach Mukherjee; John Whittaker

Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation Experiments

Luka Kovačević, Izzy Newsham, Sach Mukherjee, John Whittaker

TL;DR

This work tackles the challenge of context-specific benchmarking for causal structure learning (CSL) in gene perturbation settings by introducing CausalRegNet, a scalable multiplicative-effect structural causal model tailored to single-cell gene expression data. CausalRegNet generates both observational and interventional data with biologically meaningful nodewise distributions (negative binomial) and a sigmoidal regulatory function, calibrated to domain knowledge through closed-form conditions, and augmented with an adjustment to handle low observational means. The authors validate the simulator against real perturb-seq data, demonstrate low varsortability relative to existing simulators, and show its utility through interventional analyses, distributional fidelity measures (e.g., Wasserstein distance), and a CSL benchmarking study. Overall, CausalRegNet provides a practical, context-aware framework for evaluating CSL methods, training data for CSL models, and guiding experimental design in large-scale gene perturbation studies.

Abstract

Causal structure learning (CSL) refers to the task of learning causal relationships from data. Advances in CSL now allow learning of causal graphs in diverse application domains, which has the potential to facilitate data-driven causal decision-making. Real-world CSL performance depends on a number of $\textit{context-specific}$ factors, including context-specific data distributions and non-linear dependencies, that are important in practical use-cases. However, our understanding of how to assess and select CSL methods in specific contexts remains limited. To address this gap, we present $\textit{CausalRegNet}$, a multiplicative effect structural causal model that allows for generating observational and interventional data incorporating context-specific properties, with a focus on the setting of gene perturbation experiments. Using real-world gene perturbation data, we show that CausalRegNet generates accurate distributions and scales far better than current simulation frameworks. We illustrate the use of CausalRegNet in assessing CSL methods in the context of interventional experiments in biology.

Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation Experiments

TL;DR

Abstract

factors, including context-specific data distributions and non-linear dependencies, that are important in practical use-cases. However, our understanding of how to assess and select CSL methods in specific contexts remains limited. To address this gap, we present

, a multiplicative effect structural causal model that allows for generating observational and interventional data incorporating context-specific properties, with a focus on the setting of gene perturbation experiments. Using real-world gene perturbation data, we show that CausalRegNet generates accurate distributions and scales far better than current simulation frameworks. We illustrate the use of CausalRegNet in assessing CSL methods in the context of interventional experiments in biology.

Paper Structure (36 sections, 18 equations, 8 figures, 4 tables)

This paper contains 36 sections, 18 equations, 8 figures, 4 tables.

Introduction
Related Work.
Contributions.
Background
Structural Causal Models
Interventions.
Additive Noise Models
Methodology
Desiderata
CausalRegNet
Node-wise Distribution
Regulatory Function
Model Specification
Example: Model Specification for Linear Aggregation with Mean-normalisation
Adjusted Regulatory Function
...and 21 more sections

Figures (8)

Figure 1: Example calibrated regulatory effect function. The red line indicates the cut-off, the regulatory effect will never go left of this line.
Figure 2: We examine the treatment effect of $X_0$ on $X_1$ as (a) the number of parents of $X_1$ increases and (b) as the number of mediators between $X_0$ and $X_1$ increases (i.e. for $k=0,\ldots, 9$ under (a) and (b)).
Figure 3: (a) shows absolute ATE with increasing numbers of parents. In (b), the mean observational expression of $X_1$ against number of parents. (c) shows the absolute ATE falls for both simulators as the number of mediators increases.
Figure 4: (a) Simulation time for data from graphs of increasing size. Data is generated from the same causal graph structures across methods. Computational time for a single simulation with SERGIO with 5,000 nodes or more exceeds 24 hours and so is not included. Mean varsortability of data generated from (b) a causal chain and (c) a causal graph structure by each simulator with 95% confidence intervals.
Figure 5: (a) Comparison between real and simulated data for nodes fitted to genes in replogle2022mapping in DAG with 3 nodes. (b) In green, WD for simulated marginal distributions fitted to replogle2022mapping compared to the true distributions. In purple, the distribution labels are shuffled such that each simulated marginal is compared to a random empirical marginal, giving a baseline for comparison. The distribution of interventional effects with (c) $\alpha_j = 2$ and (d) $\alpha_j=5$ for each node.
...and 3 more figures

Theorems & Definitions (3)

Definition 2.1: Structural Causal Model; peters2017elements
Definition 2.2: Additive Noise Model; hoyer2008nonlinear
Definition 2.3: Varsortability; reisach2021beware

Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation Experiments

TL;DR

Abstract

Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation Experiments

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (3)