Table of Contents
Fetching ...

Efficient Ensemble Conditional Independence Test Framework for Causal Discovery

Zhengkang Guan, Kun Kuang

TL;DR

The Ensemble Conditional Independence Test (E-CIT), a general-purpose and plug-and-play framework that reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed, is introduced.

Abstract

Constraint-based causal discovery relies on numerous conditional independence tests (CITs), but its practical applicability is severely constrained by the prohibitive computational cost, especially as CITs themselves have high time complexity with respect to the sample size. To address this key bottleneck, we introduce the Ensemble Conditional Independence Test (E-CIT), a general-purpose and plug-and-play framework. E-CIT operates on an intuitive divide-and-aggregate strategy: it partitions the data into subsets, applies a given base CIT independently to each subset, and aggregates the resulting p-values using a novel method grounded in the properties of stable distributions. This framework reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed. Moreover, our tailored p-value combination method offers theoretical consistency guarantees under mild conditions on the subtests. Experimental results demonstrate that E-CIT not only significantly reduces the computational burden of CITs and causal discovery but also achieves competitive performance. Notably, it exhibits an improvement in complex testing scenarios, particularly on real-world datasets.

Efficient Ensemble Conditional Independence Test Framework for Causal Discovery

TL;DR

The Ensemble Conditional Independence Test (E-CIT), a general-purpose and plug-and-play framework that reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed, is introduced.

Abstract

Constraint-based causal discovery relies on numerous conditional independence tests (CITs), but its practical applicability is severely constrained by the prohibitive computational cost, especially as CITs themselves have high time complexity with respect to the sample size. To address this key bottleneck, we introduce the Ensemble Conditional Independence Test (E-CIT), a general-purpose and plug-and-play framework. E-CIT operates on an intuitive divide-and-aggregate strategy: it partitions the data into subsets, applies a given base CIT independently to each subset, and aggregates the resulting p-values using a novel method grounded in the properties of stable distributions. This framework reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed. Moreover, our tailored p-value combination method offers theoretical consistency guarantees under mild conditions on the subtests. Experimental results demonstrate that E-CIT not only significantly reduces the computational burden of CITs and causal discovery but also achieves competitive performance. Notably, it exhibits an improvement in complex testing scenarios, particularly on real-world datasets.

Paper Structure

This paper contains 29 sections, 9 theorems, 41 equations, 8 figures, 10 tables.

Key Result

Proposition 1

Let $X_1, X_2, \dots, X_J$ be independent and identically distributed (i.i.d.) random variables following a stable distribution: Then, the normalized sum also follows a stable distribution: $S_J \sim \mathbf{S} \left(\alpha, \beta, \gamma^{\prime}, \delta \right)$, where $\gamma^{\prime} = J^{\frac{1}{\alpha} - 1} \gamma.$

Figures (8)

  • Figure 1: Overview of the E-CIT framework. Each scatter plot displays samples of variables $X$ and $Y$, with color indicating the value of $Z$. Despite smaller subset sizes, the marginal dependence (black contours) and conditional independence given $Z$ (blue contours) remain clearly distinguishable.
  • Figure 2: Comparison of Type I error (left; 0.05 significance level marked by solid black line), test power (middle), and runtime (right) for KCIT, RCIT, FastKCIT, and E-KCIT under different noise distributions.
  • Figure 3: Comparison of causal discovery performance (F1-score, SHD, and runtime) of KCIT, RCIT, and E-KCIT under different noise distributions. Shaded areas indicate $\pm 1$ standard deviation.
  • Figure 4: An Example Satisfying the First Two Conditions of Theorem \ref{['thm:2']}: $p_k^{H_1} \sim \mathrm{Beta}(5, 95)$
  • Figure 5: Empirical evaluation of $\alpha$ using E-KCIT. Type I error (left) and power (right). The red line indicates the power of mean-p.
  • ...and 3 more figures

Theorems & Definitions (17)

  • Definition 1: Stable Distribution
  • Proposition 1
  • Definition 2: Ensemble Test
  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Remark 1
  • Remark 2
  • Remark 3
  • Proposition A.1: stable
  • ...and 7 more