Table of Contents
Fetching ...

dotears: Scalable, consistent DAG estimation using observational and interventional data

Albert Xue, Jingyou Rao, Sriram Sankararaman, Harold Pimentel

TL;DR

dotears addresses the identifiability and scalability challenges of learning causal DAGs from observational data by integrating interventional data to estimate exogenous variance, enabling a consistent joint structure estimator under a linear SEM. The method extends NO TEARS with a marginal exogenous-variance estimator 𝛀̂₀ and an interventional loss that jointly optimizes a single weighted adjacency W subject to a continuous DAG constraint h(W)=0. Theoretical results show that dotears is consistent as 𝛀̂₀ converges to the true variance Ω₀, and simulations demonstrate robust performance across toy, large-scale, and genome-wide Perturb-seq settings, often outperforming state-of-the-art methods. In both synthetic and real Perturb-seq data, dotears yields edges with higher precision and recall, validated via differential expression tests and high-confidence protein interactions, illustrating its practical impact for inferring gene regulatory networks from perturbational data. Overall, dotears provides a principled, scalable framework for causal DAG estimation that leverages interventional information to overcome variance-induced identifiability issues.

Abstract

New biological assays like Perturb-seq link highly parallel CRISPR interventions to a high-dimensional transcriptomic readout, providing insight into gene regulatory networks. Causal gene regulatory networks can be represented by directed acyclic graph (DAGs), but learning DAGs from observational data is complicated by lack of identifiability and a combinatorial solution space. Score-based structure learning improves practical scalability of inferring DAGs. Previous score-based methods are sensitive to error variance structure; on the other hand, estimation of error variance is difficult without prior knowledge of structure. Accordingly, we present $\texttt{dotears}$ [doo-tairs], a continuous optimization framework which leverages observational and interventional data to infer a single causal structure, assuming a linear Structural Equation Model (SEM). $\texttt{dotears}$ exploits structural consequences of hard interventions to give a marginal estimate of exogenous error structure, bypassing the circular estimation problem. We show that $\texttt{dotears}$ is a provably consistent estimator of the true DAG under mild assumptions. $\texttt{dotears}$ outperforms other methods in varied simulations, and in real data infers edges that validate with higher precision and recall than state-of-the-art methods through differential expression tests and high-confidence protein-protein interactions.

dotears: Scalable, consistent DAG estimation using observational and interventional data

TL;DR

dotears addresses the identifiability and scalability challenges of learning causal DAGs from observational data by integrating interventional data to estimate exogenous variance, enabling a consistent joint structure estimator under a linear SEM. The method extends NO TEARS with a marginal exogenous-variance estimator 𝛀̂₀ and an interventional loss that jointly optimizes a single weighted adjacency W subject to a continuous DAG constraint h(W)=0. Theoretical results show that dotears is consistent as 𝛀̂₀ converges to the true variance Ω₀, and simulations demonstrate robust performance across toy, large-scale, and genome-wide Perturb-seq settings, often outperforming state-of-the-art methods. In both synthetic and real Perturb-seq data, dotears yields edges with higher precision and recall, validated via differential expression tests and high-confidence protein interactions, illustrating its practical impact for inferring gene regulatory networks from perturbational data. Overall, dotears provides a principled, scalable framework for causal DAG estimation that leverages interventional information to overcome variance-induced identifiability issues.

Abstract

New biological assays like Perturb-seq link highly parallel CRISPR interventions to a high-dimensional transcriptomic readout, providing insight into gene regulatory networks. Causal gene regulatory networks can be represented by directed acyclic graph (DAGs), but learning DAGs from observational data is complicated by lack of identifiability and a combinatorial solution space. Score-based structure learning improves practical scalability of inferring DAGs. Previous score-based methods are sensitive to error variance structure; on the other hand, estimation of error variance is difficult without prior knowledge of structure. Accordingly, we present [doo-tairs], a continuous optimization framework which leverages observational and interventional data to infer a single causal structure, assuming a linear Structural Equation Model (SEM). exploits structural consequences of hard interventions to give a marginal estimate of exogenous error structure, bypassing the circular estimation problem. We show that is a provably consistent estimator of the true DAG under mild assumptions. outperforms other methods in varied simulations, and in real data infers edges that validate with higher precision and recall than state-of-the-art methods through differential expression tests and high-confidence protein-protein interactions.
Paper Structure (49 sections, 11 theorems, 136 equations, 24 figures, 7 tables)

This paper contains 49 sections, 11 theorems, 136 equations, 24 figures, 7 tables.

Key Result

Lemma 1

Let $\gamma \coloneqq \frac{\sigma_1^2}{\sigma_2^2}$. The system $X_1 \overset{w}{\rightarrow} X_2$ is varsortable if and only if $|w| \geq \sqrt{1 - \frac{1}{\gamma}}$.

Figures (24)

  • Figure 1: Hard interventions allow marginal estimation of the error variances $\Omega_0 \coloneqq \text{diag}\left(\sigma_1^2, \sigma_2^2\right)$. (a) The true observational DAG $X_1 \overset{w}{\to} X_2$, with corresponding distributions. (b) Hard interventions remove incoming edges to the target. The marginal variance shrinks due to removal of upstream variance and effects from the intervention.
  • Figure 2: Comparison of $\ell_1$ distance (lower is better) between true structure and estimates from NO TEARS (black), GOLEM-NV (orange), NO TEARS interventional (green), and dotears (blue). Each method corrects differently for $\Omega_0$. For each $w = 0.1, 0.2, \dots, 1.0$ and $\gamma = 1, 2, 100$, we generate Gaussian data from the structure $X_1 \overset{w}{\to} X_2$ such that $\sigma_1^2 = \gamma \sigma_2^2$. For each pair $w, \gamma$, we draw 25 simulations at a sample size of $n = 3000$. The dashed grey line represents the varsortability bound $|w| \geq \sqrt{1 - \frac{1}{\gamma}}$. Bars represent standard errors; some standard errors are too small to see.
  • Figure 3: Method performance on large random graphs ($p$ = 40) using Structural Hamming Distance (lower is better). Rows index Erdős-Rényi or Scale Free topologies. Columns index parameterizations of edge density and weight, ordered in increasing difficulty. For details see Supplementary Material \ref{['suppsection:large p data generation']}. 10 simulations were drawn for each parameterization with sample size $(p+1) * 100 = 4100$. * indicates cross-validated methods. Methods are sorted by average performance.
  • Figure 4: a) dotears-inferred network. Edges with magnitude less than 0.2, and genes without inferred edges, were removed. b) Precision-recall curves across differential expression calls made by DESeq2. Dashed red lines indicate recall of dotears at thresholds of $|w| < 0.2, 0.1,$ and $0.05$ respectively. c) Precision-recall curves across high confidence protein-protein interactions nominated by STRING. Dashed red lines indicate recall of dotears at thresholds of $|w| < 0.2, 0.1,$ and $0.05$ respectively. d) dotears infers HSP90AB1 $\rightarrow$ HSP90AA1. HSP90AB1 knockdown increases expression of HSP90AA1, but HSP90AA1 knockdown does not change HSP90AB1 expression. d) dotears inferred edges show correlated gene expression in hold-out observational data.
  • Figure 5: Full simulations for the two-node DAG for all $(w, \gamma) \in \{0.1, 0.2, \dots ,1.5\} \times \{1, 2, 4, 10, 100\}$, where $X_1 \overset{w}{\rightarrow} X_2$ and $\sigma_1^2 = \gamma\sigma_2^2$. Methods are compared using SHD (lower is better).
  • ...and 19 more figures

Theorems & Definitions (21)

  • Lemma 1
  • Theorem 1
  • Corollary 1
  • proof
  • proof
  • Lemma 2
  • Corollary 2
  • Remark 1
  • Remark 2
  • Theorem 2
  • ...and 11 more