Estimating large causal polytrees from small samples

Sourav Chatterjee; Mathukumalli Vidyasagar

Estimating large causal polytrees from small samples

Sourav Chatterjee, Mathukumalli Vidyasagar

TL;DR

This work tackles the problem of learning large causal polytrees from small samples in high-dimensional settings by presenting a fully nonparametric, two-stage approach. First, it recovers the skeleton using a pairwise $\xi$-coefficient and a maximal weighted spanning forest, with a high-probability guarantee that the estimated skeleton matches the true skeleton when $n$ scales favorably with $\log p$. Second, it recovers edge directions using a conditional dependence statistic $\tau_n$ and $\xi$-based comparisons, with a provable high-probability correctness for the resulting DAG under mild assumptions. The method is implemented in an R package and demonstrated on simulations and a real mortgage-subsidy dataset, illustrating robust performance even when $p$ is large relative to $n$. Overall, the paper provides a scalable, nonparametric framework for causal structure discovery in polytrees with strong theoretical guarantees and practical applicability to genomics and other high-dimensional domains.

Abstract

We consider the problem of estimating a large causal polytree from a relatively small i.i.d. sample. This is motivated by the problem of determining causal structure when the number of variables is very large compared to the sample size, such as in gene regulatory networks. We give an algorithm that recovers the tree with high accuracy in such settings. The algorithm works under essentially no distributional or modeling assumptions other than some mild non-degeneracy conditions.

Estimating large causal polytrees from small samples

TL;DR

-coefficient and a maximal weighted spanning forest, with a high-probability guarantee that the estimated skeleton matches the true skeleton when

scales favorably with

. Second, it recovers edge directions using a conditional dependence statistic

and

-based comparisons, with a provable high-probability correctness for the resulting DAG under mild assumptions. The method is implemented in an R package and demonstrated on simulations and a real mortgage-subsidy dataset, illustrating robust performance even when

is large relative to

. Overall, the paper provides a scalable, nonparametric framework for causal structure discovery in polytrees with strong theoretical guarantees and practical applicability to genomics and other high-dimensional domains.

Abstract

Paper Structure (18 sections, 14 theorems, 63 equations, 1 figure, 2 tables)

This paper contains 18 sections, 14 theorems, 63 equations, 1 figure, 2 tables.

Introduction
Directed acyclic graphs
Algorithm for recovering the skeleton
Theoretical guarantee for skeleton recovery
Algorithm for recovering directionalities
R package
Theoretical guarantee for recovering directionalities
Examples
Simulations
Real data
Proof of Theorem \ref{['mainresult']}
A property of causal polytrees
Concentration of $\xi_n$
Data processing inequality for maximal correlation
Data processing inequality for $\xi$-correlation
...and 3 more sections

Key Result

Theorem 4.1

Let $X = (X_i)_{i\in V}$ be a finite collection of random variables with a causal polytree skeleton $T$, as defined at the beginning of Section intro, and let $p:=|V|$. Let $T_n$ be the estimate of $T$ based on a sample of $n$ i.i.d. copies of $X$, as defined in Section algo. For each $i$ and $j$, l Furthermore, suppose that $n$ is so large that $|\mathbb{E}(\xi_{ij}^n) - \xi_{ij}| \le \delta^2/8$

Figures (1)

Figure 1: Estimated causal polytree for the mortgage data.

Theorems & Definitions (26)

Theorem 4.1
Theorem 7.1
Proposition 9.1
proof
Proposition 9.2
Corollary 9.3
proof
Lemma 9.4
proof
Lemma 9.5
...and 16 more

Estimating large causal polytrees from small samples

TL;DR

Abstract

Estimating large causal polytrees from small samples

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (26)