Table of Contents
Fetching ...

Estimating large causal polytrees from small samples

Sourav Chatterjee, Mathukumalli Vidyasagar

TL;DR

This work tackles the problem of learning large causal polytrees from small samples in high-dimensional settings by presenting a fully nonparametric, two-stage approach. First, it recovers the skeleton using a pairwise $\xi$-coefficient and a maximal weighted spanning forest, with a high-probability guarantee that the estimated skeleton matches the true skeleton when $n$ scales favorably with $\log p$. Second, it recovers edge directions using a conditional dependence statistic $\tau_n$ and $\xi$-based comparisons, with a provable high-probability correctness for the resulting DAG under mild assumptions. The method is implemented in an R package and demonstrated on simulations and a real mortgage-subsidy dataset, illustrating robust performance even when $p$ is large relative to $n$. Overall, the paper provides a scalable, nonparametric framework for causal structure discovery in polytrees with strong theoretical guarantees and practical applicability to genomics and other high-dimensional domains.

Abstract

We consider the problem of estimating a large causal polytree from a relatively small i.i.d. sample. This is motivated by the problem of determining causal structure when the number of variables is very large compared to the sample size, such as in gene regulatory networks. We give an algorithm that recovers the tree with high accuracy in such settings. The algorithm works under essentially no distributional or modeling assumptions other than some mild non-degeneracy conditions.

Estimating large causal polytrees from small samples

TL;DR

This work tackles the problem of learning large causal polytrees from small samples in high-dimensional settings by presenting a fully nonparametric, two-stage approach. First, it recovers the skeleton using a pairwise -coefficient and a maximal weighted spanning forest, with a high-probability guarantee that the estimated skeleton matches the true skeleton when scales favorably with . Second, it recovers edge directions using a conditional dependence statistic and -based comparisons, with a provable high-probability correctness for the resulting DAG under mild assumptions. The method is implemented in an R package and demonstrated on simulations and a real mortgage-subsidy dataset, illustrating robust performance even when is large relative to . Overall, the paper provides a scalable, nonparametric framework for causal structure discovery in polytrees with strong theoretical guarantees and practical applicability to genomics and other high-dimensional domains.

Abstract

We consider the problem of estimating a large causal polytree from a relatively small i.i.d. sample. This is motivated by the problem of determining causal structure when the number of variables is very large compared to the sample size, such as in gene regulatory networks. We give an algorithm that recovers the tree with high accuracy in such settings. The algorithm works under essentially no distributional or modeling assumptions other than some mild non-degeneracy conditions.
Paper Structure (18 sections, 14 theorems, 63 equations, 1 figure, 2 tables)

This paper contains 18 sections, 14 theorems, 63 equations, 1 figure, 2 tables.

Key Result

Theorem 4.1

Let $X = (X_i)_{i\in V}$ be a finite collection of random variables with a causal polytree skeleton $T$, as defined at the beginning of Section intro, and let $p:=|V|$. Let $T_n$ be the estimate of $T$ based on a sample of $n$ i.i.d. copies of $X$, as defined in Section algo. For each $i$ and $j$, l Furthermore, suppose that $n$ is so large that $|\mathbb{E}(\xi_{ij}^n) - \xi_{ij}| \le \delta^2/8$

Figures (1)

  • Figure 1: Estimated causal polytree for the mortgage data.

Theorems & Definitions (26)

  • Theorem 4.1
  • Theorem 7.1
  • Proposition 9.1
  • proof
  • Proposition 9.2
  • Corollary 9.3
  • proof
  • Lemma 9.4
  • proof
  • Lemma 9.5
  • ...and 16 more