Table of Contents
Fetching ...

Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

Gongxu Luo, Haoyue Dai, Loka Li, Chengqian Gao, Boyang Sun, Kun Zhang

TL;DR

Gene regulatory network inference is confounded by latent factors and pervasive selection bias, which can masquerade as regulatory dependencies. The authors introduce GISL, a nonparametric causal discovery algorithm that integrates observational and perturbation data to differentiate regulatory edges, latent confounders, and selection processes by leveraging perturbation symmetry and CI patterns within an augmented DAG. They prove identifiability of causal relations, selection, and latent confounders up to CI-pattern equivalence under standard assumptions, and validate GISL on synthetic data and real single-cell perturbation datasets, demonstrating improved precision and robustness over baseline methods. The work provides a principled framework for disentangling complex upstream processes in GRNI, enabling more reliable causal inferences and revealing non-regulatory mechanisms of selection and confounding that shape observed gene expression. Overall, GISL offers a practical path toward accurate interventional causal discovery in genomics, with potential impacts on targeted interventions and understanding cellular regulation under bias.

Abstract

Gene regulatory network inference (GRNI) aims to discover how genes causally regulate each other from gene expression data. It is well-known that statistical dependencies in observed data do not necessarily imply causation, as spurious dependencies may arise from latent confounders, such as non-coding RNAs. Numerous GRNI methods have thus been proposed to address this confounding issue. However, dependencies may also result from selection--only cells satisfying certain survival or inclusion criteria are observed--while these selection-induced spurious dependencies are frequently overlooked in gene expression data analyses. In this work, we show that such selection is ubiquitous and, when ignored or conflated with true regulations, can lead to flawed causal interpretation and misguided intervention recommendations. To address this challenge, a fundamental question arises: can we distinguish dependencies due to regulation, confounding, and crucially, selection? We show that gene perturbations offer a simple yet effective answer: selection-induced dependencies are symmetric under perturbation, while those from regulation or confounding are not. Building on this motivation, we propose GISL (Gene regulatory network Inference in the presence of Selection bias and Latent confounders), a principled algorithm that leverages perturbation data to uncover both true gene regulatory relations and non-regulatory mechanisms of selection and confounding up to the equivalence class. Experiments on synthetic and real-world gene expression data demonstrate the effectiveness of our method.

Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

TL;DR

Gene regulatory network inference is confounded by latent factors and pervasive selection bias, which can masquerade as regulatory dependencies. The authors introduce GISL, a nonparametric causal discovery algorithm that integrates observational and perturbation data to differentiate regulatory edges, latent confounders, and selection processes by leveraging perturbation symmetry and CI patterns within an augmented DAG. They prove identifiability of causal relations, selection, and latent confounders up to CI-pattern equivalence under standard assumptions, and validate GISL on synthetic data and real single-cell perturbation datasets, demonstrating improved precision and robustness over baseline methods. The work provides a principled framework for disentangling complex upstream processes in GRNI, enabling more reliable causal inferences and revealing non-regulatory mechanisms of selection and confounding that shape observed gene expression. Overall, GISL offers a practical path toward accurate interventional causal discovery in genomics, with potential impacts on targeted interventions and understanding cellular regulation under bias.

Abstract

Gene regulatory network inference (GRNI) aims to discover how genes causally regulate each other from gene expression data. It is well-known that statistical dependencies in observed data do not necessarily imply causation, as spurious dependencies may arise from latent confounders, such as non-coding RNAs. Numerous GRNI methods have thus been proposed to address this confounding issue. However, dependencies may also result from selection--only cells satisfying certain survival or inclusion criteria are observed--while these selection-induced spurious dependencies are frequently overlooked in gene expression data analyses. In this work, we show that such selection is ubiquitous and, when ignored or conflated with true regulations, can lead to flawed causal interpretation and misguided intervention recommendations. To address this challenge, a fundamental question arises: can we distinguish dependencies due to regulation, confounding, and crucially, selection? We show that gene perturbations offer a simple yet effective answer: selection-induced dependencies are symmetric under perturbation, while those from regulation or confounding are not. Building on this motivation, we propose GISL (Gene regulatory network Inference in the presence of Selection bias and Latent confounders), a principled algorithm that leverages perturbation data to uncover both true gene regulatory relations and non-regulatory mechanisms of selection and confounding up to the equivalence class. Experiments on synthetic and real-world gene expression data demonstrate the effectiveness of our method.
Paper Structure (38 sections, 1 theorem, 4 equations, 14 figures, 7 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 4 equations, 14 figures, 7 tables, 1 algorithm.

Key Result

Theorem 3.7

(Identifiability of GISL) Let the observational and perturbation data be generated from the DAG model $\mathcal{G}$ defined in Equation eq1. Under Markov markov and faithfulness faith assumptions, when the sample size $n\rightarrow \infty$, the causal relationships, selection processes, and latent c

Figures (14)

  • Figure 1: Alternative graphical representations of interventions. (a) Mutilated DAGs depicting hard intervention hauser2012characterization. (b) Generalized intervention representation using the augmented DAGyang2018characterizing. (c) Augmented DAG for confounded pairs, where $L$ denotes a latent confounder magliacane2016ancestral.
  • Figure 2: (a) Scatterplot of $X$ and $Y$ showing selected patients ('$\bullet$') and excluded individuals ('$\times$'). (b) and (c) Distributions after two distinct interventions on variables $X$ and $Y$, respectively, in the selected population ('$\bullet$' in (a)).
  • Figure 3: Distinguishing causal, selection, and confounding structures via perturbation effect and symmetry. indicates the targeted gene pairs of CI test. (a) refers to the direct cause structure between X and Y (represented by 'C'). (b) means there is a latent confounder between them ('L'). (c) is the structure of selection process ('S'). (d) stands for causation and latent confounders at the same time ('C & L'). (e) stands for causation and selection process at the same time ('C & S'). contains CI results. (f) serves as a reference table summarizing the CI patterns for each target gene pair: different symbols correspond to different CI relations; black symbols ($\blacktriangle$, $\blacktriangledown$, $\blacklozenge$, $\bullet$) indicate the conditional independence, while white symbols ($\triangle$, $\triangledown$, $\lozenge$, $\circ$) indicate the conditional dependence. For example, (a) encodes four CI relations: $Y \not\!\perp\!\!\!\perp I_X \mid S$ ($\triangle$) and $Y \perp \!\!\! \perp I_X \mid X, S$ ($\blacktriangledown$) at the top; $X \perp \!\!\! \perp I_Y \mid S$ ($\blacklozenge$) and $X \not\!\perp\!\!\!\perp I_Y \mid Y, S$ ($\circ$) at the bottom.
  • Figure 4: Comparison results in identifying regulatory relations under four metrics: DAG $F_1$, DAG Precision, DAG Recall, and DAG SHD (Structural Hamming Distance). All values are averaged over 10 runs with different random seeds. Error bars represent the 95% confidence interval.
  • Figure 5: Experimental result on 19 perturbed key genes from perturb-seq Dixit_Parnas_Li_Chen_Fulco_JerbyArnon_Marjanovic_Dionne_Burks_Raychowdhury_etal._2016. S and L imply the detected selection and confounded pairs, blue edges are the regulatory interactions priorly known kuleshov2016enrichr.
  • ...and 9 more figures

Theorems & Definitions (14)

  • Definition 2.1: Confounded pair
  • Definition 2.2: Selection pair
  • Remark 3.3
  • Remark 3.4
  • Definition 3.5: d-separation pearl1988probabilistic
  • Definition 3.6: Inducing path zhang2008completeness
  • Theorem 3.7
  • Remark 3.8
  • Definition A.1: Marginal independence test
  • Definition A.2: Conditional independence test
  • ...and 4 more