dotears: Scalable, consistent DAG estimation using observational and interventional data
Albert Xue, Jingyou Rao, Sriram Sankararaman, Harold Pimentel
TL;DR
dotears addresses the identifiability and scalability challenges of learning causal DAGs from observational data by integrating interventional data to estimate exogenous variance, enabling a consistent joint structure estimator under a linear SEM. The method extends NO TEARS with a marginal exogenous-variance estimator 𝛀̂₀ and an interventional loss that jointly optimizes a single weighted adjacency W subject to a continuous DAG constraint h(W)=0. Theoretical results show that dotears is consistent as 𝛀̂₀ converges to the true variance Ω₀, and simulations demonstrate robust performance across toy, large-scale, and genome-wide Perturb-seq settings, often outperforming state-of-the-art methods. In both synthetic and real Perturb-seq data, dotears yields edges with higher precision and recall, validated via differential expression tests and high-confidence protein interactions, illustrating its practical impact for inferring gene regulatory networks from perturbational data. Overall, dotears provides a principled, scalable framework for causal DAG estimation that leverages interventional information to overcome variance-induced identifiability issues.
Abstract
New biological assays like Perturb-seq link highly parallel CRISPR interventions to a high-dimensional transcriptomic readout, providing insight into gene regulatory networks. Causal gene regulatory networks can be represented by directed acyclic graph (DAGs), but learning DAGs from observational data is complicated by lack of identifiability and a combinatorial solution space. Score-based structure learning improves practical scalability of inferring DAGs. Previous score-based methods are sensitive to error variance structure; on the other hand, estimation of error variance is difficult without prior knowledge of structure. Accordingly, we present $\texttt{dotears}$ [doo-tairs], a continuous optimization framework which leverages observational and interventional data to infer a single causal structure, assuming a linear Structural Equation Model (SEM). $\texttt{dotears}$ exploits structural consequences of hard interventions to give a marginal estimate of exogenous error structure, bypassing the circular estimation problem. We show that $\texttt{dotears}$ is a provably consistent estimator of the true DAG under mild assumptions. $\texttt{dotears}$ outperforms other methods in varied simulations, and in real data infers edges that validate with higher precision and recall than state-of-the-art methods through differential expression tests and high-confidence protein-protein interactions.
