Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

Gongxu Luo; Haoyue Dai; Loka Li; Chengqian Gao; Boyang Sun; Kun Zhang

Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

Gongxu Luo, Haoyue Dai, Loka Li, Chengqian Gao, Boyang Sun, Kun Zhang

TL;DR

Gene regulatory network inference is confounded by latent factors and pervasive selection bias, which can masquerade as regulatory dependencies. The authors introduce GISL, a nonparametric causal discovery algorithm that integrates observational and perturbation data to differentiate regulatory edges, latent confounders, and selection processes by leveraging perturbation symmetry and CI patterns within an augmented DAG. They prove identifiability of causal relations, selection, and latent confounders up to CI-pattern equivalence under standard assumptions, and validate GISL on synthetic data and real single-cell perturbation datasets, demonstrating improved precision and robustness over baseline methods. The work provides a principled framework for disentangling complex upstream processes in GRNI, enabling more reliable causal inferences and revealing non-regulatory mechanisms of selection and confounding that shape observed gene expression. Overall, GISL offers a practical path toward accurate interventional causal discovery in genomics, with potential impacts on targeted interventions and understanding cellular regulation under bias.

Abstract

Gene regulatory network inference (GRNI) aims to discover how genes causally regulate each other from gene expression data. It is well-known that statistical dependencies in observed data do not necessarily imply causation, as spurious dependencies may arise from latent confounders, such as non-coding RNAs. Numerous GRNI methods have thus been proposed to address this confounding issue. However, dependencies may also result from selection--only cells satisfying certain survival or inclusion criteria are observed--while these selection-induced spurious dependencies are frequently overlooked in gene expression data analyses. In this work, we show that such selection is ubiquitous and, when ignored or conflated with true regulations, can lead to flawed causal interpretation and misguided intervention recommendations. To address this challenge, a fundamental question arises: can we distinguish dependencies due to regulation, confounding, and crucially, selection? We show that gene perturbations offer a simple yet effective answer: selection-induced dependencies are symmetric under perturbation, while those from regulation or confounding are not. Building on this motivation, we propose GISL (Gene regulatory network Inference in the presence of Selection bias and Latent confounders), a principled algorithm that leverages perturbation data to uncover both true gene regulatory relations and non-regulatory mechanisms of selection and confounding up to the equivalence class. Experiments on synthetic and real-world gene expression data demonstrate the effectiveness of our method.

Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

TL;DR

Abstract

Paper Structure (38 sections, 1 theorem, 4 equations, 14 figures, 7 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 4 equations, 14 figures, 7 tables, 1 algorithm.

Introduction
Preliminaries
Causal formulation of gene regulatory networks and gene perturbations
Understanding selection bias: principles and key characteristics
Methodology
Differentiating causal relations, selection processes, and latent confounders
Algorithm GISL: Handling both selection bias and latent confounding
Identifiability result of the GISL algorithm
Experiments
Identify the selection bias on synthetic data
Nonparametric settings.
Identify regulatory relationships on synthetic data
The presence of selection bias on single-cell gene expression data
Conclusion and Discussion
Concepts
...and 23 more sections

Key Result

Theorem 3.7

(Identifiability of GISL) Let the observational and perturbation data be generated from the DAG model $\mathcal{G}$ defined in Equation eq1. Under Markov markov and faithfulness faith assumptions, when the sample size $n\rightarrow \infty$, the causal relationships, selection processes, and latent c

Figures (14)

Figure 1: Alternative graphical representations of interventions. (a) Mutilated DAGs depicting hard intervention hauser2012characterization. (b) Generalized intervention representation using the augmented DAGyang2018characterizing. (c) Augmented DAG for confounded pairs, where $L$ denotes a latent confounder magliacane2016ancestral.
Figure 2: (a) Scatterplot of $X$ and $Y$ showing selected patients ('$\bullet$') and excluded individuals ('$\times$'). (b) and (c) Distributions after two distinct interventions on variables $X$ and $Y$, respectively, in the selected population ('$\bullet$' in (a)).
Figure 3: Distinguishing causal, selection, and confounding structures via perturbation effect and symmetry. indicates the targeted gene pairs of CI test. (a) refers to the direct cause structure between X and Y (represented by 'C'). (b) means there is a latent confounder between them ('L'). (c) is the structure of selection process ('S'). (d) stands for causation and latent confounders at the same time ('C & L'). (e) stands for causation and selection process at the same time ('C & S'). contains CI results. (f) serves as a reference table summarizing the CI patterns for each target gene pair: different symbols correspond to different CI relations; black symbols ($\blacktriangle$, $\blacktriangledown$, $\blacklozenge$, $\bullet$) indicate the conditional independence, while white symbols ($\triangle$, $\triangledown$, $\lozenge$, $\circ$) indicate the conditional dependence. For example, (a) encodes four CI relations: $Y \not\!\perp\!\!\!\perp I_X \mid S$ ($\triangle$) and $Y \perp \!\!\! \perp I_X \mid X, S$ ($\blacktriangledown$) at the top; $X \perp \!\!\! \perp I_Y \mid S$ ($\blacklozenge$) and $X \not\!\perp\!\!\!\perp I_Y \mid Y, S$ ($\circ$) at the bottom.
Figure 4: Comparison results in identifying regulatory relations under four metrics: DAG $F_1$, DAG Precision, DAG Recall, and DAG SHD (Structural Hamming Distance). All values are averaged over 10 runs with different random seeds. Error bars represent the 95% confidence interval.
Figure 5: Experimental result on 19 perturbed key genes from perturb-seq Dixit_Parnas_Li_Chen_Fulco_JerbyArnon_Marjanovic_Dionne_Burks_Raychowdhury_etal._2016. S and L imply the detected selection and confounded pairs, blue edges are the regulatory interactions priorly known kuleshov2016enrichr.
...and 9 more figures

Theorems & Definitions (14)

Definition 2.1: Confounded pair
Definition 2.2: Selection pair
Remark 3.3
Remark 3.4
Definition 3.5: d-separation pearl1988probabilistic
Definition 3.6: Inducing path zhang2008completeness
Theorem 3.7
Remark 3.8
Definition A.1: Marginal independence test
Definition A.2: Conditional independence test
...and 4 more

Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

TL;DR

Abstract

Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (14)