Table of Contents
Fetching ...

Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View

Haoyue Dai, Ignavier Ng, Gongxu Luo, Peter Spirtes, Petar Stojanov, Kun Zhang

TL;DR

A causal graphical model to characterize the dropout mechanism, namely, Causal Dropout Model is introduced and interestingly, the conditional independence (CI) relations in the data with dropouts, after deleting the samples with zero values for the conditioned variables, are asymptotically identical to the CI relations in the original data without dropouts.

Abstract

Gene regulatory network inference (GRNI) is a challenging problem, particularly owing to the presence of zeros in single-cell RNA sequencing data: some are biological zeros representing no gene expression, while some others are technical zeros arising from the sequencing procedure (aka dropouts), which may bias GRNI by distorting the joint distribution of the measured gene expressions. Existing approaches typically handle dropout error via imputation, which may introduce spurious relations as the true joint distribution is generally unidentifiable. To tackle this issue, we introduce a causal graphical model to characterize the dropout mechanism, namely, Causal Dropout Model. We provide a simple yet effective theoretical result: interestingly, the conditional independence (CI) relations in the data with dropouts, after deleting the samples with zero values (regardless if technical or not) for the conditioned variables, are asymptotically identical to the CI relations in the original data without dropouts. This particular test-wise deletion procedure, in which we perform CI tests on the samples without zeros for the conditioned variables, can be seamlessly integrated with existing structure learning approaches including constraint-based and greedy score-based methods, thus giving rise to a principled framework for GRNI in the presence of dropouts. We further show that the causal dropout model can be validated from data, and many existing statistical models to handle dropouts fit into our model as specific parametric instances. Empirical evaluation on synthetic, curated, and real-world experimental transcriptomic data comprehensively demonstrate the efficacy of our method.

Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View

TL;DR

A causal graphical model to characterize the dropout mechanism, namely, Causal Dropout Model is introduced and interestingly, the conditional independence (CI) relations in the data with dropouts, after deleting the samples with zero values for the conditioned variables, are asymptotically identical to the CI relations in the original data without dropouts.

Abstract

Gene regulatory network inference (GRNI) is a challenging problem, particularly owing to the presence of zeros in single-cell RNA sequencing data: some are biological zeros representing no gene expression, while some others are technical zeros arising from the sequencing procedure (aka dropouts), which may bias GRNI by distorting the joint distribution of the measured gene expressions. Existing approaches typically handle dropout error via imputation, which may introduce spurious relations as the true joint distribution is generally unidentifiable. To tackle this issue, we introduce a causal graphical model to characterize the dropout mechanism, namely, Causal Dropout Model. We provide a simple yet effective theoretical result: interestingly, the conditional independence (CI) relations in the data with dropouts, after deleting the samples with zero values (regardless if technical or not) for the conditioned variables, are asymptotically identical to the CI relations in the original data without dropouts. This particular test-wise deletion procedure, in which we perform CI tests on the samples without zeros for the conditioned variables, can be seamlessly integrated with existing structure learning approaches including constraint-based and greedy score-based methods, thus giving rise to a principled framework for GRNI in the presence of dropouts. We further show that the causal dropout model can be validated from data, and many existing statistical models to handle dropouts fit into our model as specific parametric instances. Empirical evaluation on synthetic, curated, and real-world experimental transcriptomic data comprehensively demonstrate the efficacy of our method.
Paper Structure (40 sections, 11 theorems, 5 equations, 14 figures, 2 tables)

This paper contains 40 sections, 11 theorems, 5 equations, 14 figures, 2 tables.

Key Result

Proposition 1

Assume A1, A2. $\forall i\in[p],j\in[p],\mathbf{S}\subset[p]$, we have $Z_i \not\mathrel{\hbox{$\perp$}\mkern2mu{\perp}} Z_j | \mathbf{Z}_\mathbf{S} \Rightarrow X_i \not\mathrel{\hbox{$\perp$}\mkern2mu{\perp}} X_j | \mathbf{X}_\mathbf{S}$. The reverse direction does not hold in general, and holds on

Figures (14)

  • Figure 1: Causal graph for dropouts. Gray nodes are underlying partially observed variables and white nodes are observed ones.
  • Figure 2: Above: scatterplots of $Z_1;Z_3$ and $X_1;X_3$ under different conditions. Below: density plot of $X_3$ under different conditions (vertical slices of scatters in (f)) to show $X_3 \mathrel{\hbox{$\perp$}\mkern2mu{\perp}} X_1 | X_2=1$. Kernel width is set to $=0.02$ to prevent from oversmoothing.
  • Figure 3: The remaining sample size after deleting zero conditioning samples, for each CI test during a run of PC on a real data dixit2016perturb, with $15$ genes and $9843$ cells.
  • Figure 4: Experimental results (SHDs of CPDAGs) of $30$ variables on simulated data, where three dropout mechanisms are considered. The variables $\mathbf{Z}$ follow Gaussian or Lognormal distribution.
  • Figure 5: Experimental results (F1-scores of estimated skeleton edges) on BoolODE simulated data and BEELINE benchmark framework pratapa2020benchmarking. The 9 rows correspond to PC, GES, and 7 other GRNI-specific SOTA algorithms benchmarked. The 10 colored column blocks correspond to all 6 synthetic and 4 curated datasets in pratapa2020benchmarking. The 5 column strips in each block correspond to different dropout-handling strategies. Cell colors indicate the corresponding values (brighter is higher, i.e., better). The maximum and minimum of each strip are annotated.
  • ...and 9 more figures

Theorems & Definitions (23)

  • Example 1: Dropout with the fixed rates
  • Example 2: Truncating low expressions to zero
  • Example 3: Dropout probabilistically determined by expressions
  • Example 4
  • Example 5
  • Example 6
  • Proposition 1: Bias to a denser graph
  • Theorem 1: Correct CI estimations
  • Definition 1: The general procedure for causal discovery with dropout correction
  • Definition 2: Generalized GRN and dropout mechanisms discovery
  • ...and 13 more