Estimating large causal polytrees from small samples
Sourav Chatterjee, Mathukumalli Vidyasagar
TL;DR
This work tackles the problem of learning large causal polytrees from small samples in high-dimensional settings by presenting a fully nonparametric, two-stage approach. First, it recovers the skeleton using a pairwise $\xi$-coefficient and a maximal weighted spanning forest, with a high-probability guarantee that the estimated skeleton matches the true skeleton when $n$ scales favorably with $\log p$. Second, it recovers edge directions using a conditional dependence statistic $\tau_n$ and $\xi$-based comparisons, with a provable high-probability correctness for the resulting DAG under mild assumptions. The method is implemented in an R package and demonstrated on simulations and a real mortgage-subsidy dataset, illustrating robust performance even when $p$ is large relative to $n$. Overall, the paper provides a scalable, nonparametric framework for causal structure discovery in polytrees with strong theoretical guarantees and practical applicability to genomics and other high-dimensional domains.
Abstract
We consider the problem of estimating a large causal polytree from a relatively small i.i.d. sample. This is motivated by the problem of determining causal structure when the number of variables is very large compared to the sample size, such as in gene regulatory networks. We give an algorithm that recovers the tree with high accuracy in such settings. The algorithm works under essentially no distributional or modeling assumptions other than some mild non-degeneracy conditions.
