Table of Contents
Fetching ...

Scalable Causal Discovery from Recursive Nonlinear Data via Truncated Basis Function Scores and Tests

Joseph Ramsey, Bryan Andrews, Peter Spirtes

TL;DR

This work introduces BF-BIC and BF-LRT, two basis-expansion tools for scalable causal discovery in nonlinear, mixed-data settings. BF-BIC provides a scalable, consistent score for additive nonlinear SEMs by embedding continuous variables in truncated Legendre bases, while BF-LRT delivers fast, asymptotically valid conditional independence testing via a likelihood-ratio framework on the same basis representations; both extend to post-nonlinear models through invertible transforms. The methods are integrated with efficient search algorithms (BOSS for score-based and PC-Max for constraint-based) to recover CPDAGs in large graphs, with extensive simulations showing favorable accuracy and runtime relative to kernel-based methods. A real-data application to wildfire risk demonstrates interpretable nonlinear causal structures and latent-variable considerations via FCIT and PAGs. The results suggest practical, scalable tools for causal discovery in complex scientific domains, with broad potential extensions to latent confounding, regime heterogeneity, and hybrid score-test strategies.

Abstract

Learning graphical conditional independence structures from nonlinear, continuous or mixed data is a central challenge in machine learning and the sciences, and many existing methods struggle to scale to thousands of samples or hundreds of variables. We introduce two basis-expansion tools for scalable causal discovery. First, the Basis Function BIC (BF-BIC) score uses truncated additive expansions to approximate nonlinear dependencies. BF-BIC is theoretically consistent under additive models and extends to post-nonlinear (PNL) models via an invertible reparameterization. It remains robust under moderate interactions and supports mixed data through a degenerate-Gaussian embedding for discrete variables. In simulations with fully nonlinear neural causal models (NCMs), BF-BIC outperforms kernel- and constraint-based methods (e.g., KCI, RFCI) in both accuracy and runtime. Second, the Basis Function Likelihood Ratio Test (BF-LRT) provides an approximate conditional independence test that is substantially faster than kernel tests while retaining competitive accuracy. Extensive simulations and a real-data application to Canadian wildfire risk show that, when integrated into hybrid searches, BF-based methods enable interpretable and scalable causal discovery. Implementations are available in Python, R, and Java.

Scalable Causal Discovery from Recursive Nonlinear Data via Truncated Basis Function Scores and Tests

TL;DR

This work introduces BF-BIC and BF-LRT, two basis-expansion tools for scalable causal discovery in nonlinear, mixed-data settings. BF-BIC provides a scalable, consistent score for additive nonlinear SEMs by embedding continuous variables in truncated Legendre bases, while BF-LRT delivers fast, asymptotically valid conditional independence testing via a likelihood-ratio framework on the same basis representations; both extend to post-nonlinear models through invertible transforms. The methods are integrated with efficient search algorithms (BOSS for score-based and PC-Max for constraint-based) to recover CPDAGs in large graphs, with extensive simulations showing favorable accuracy and runtime relative to kernel-based methods. A real-data application to wildfire risk demonstrates interpretable nonlinear causal structures and latent-variable considerations via FCIT and PAGs. The results suggest practical, scalable tools for causal discovery in complex scientific domains, with broad potential extensions to latent confounding, regime heterogeneity, and hybrid score-test strategies.

Abstract

Learning graphical conditional independence structures from nonlinear, continuous or mixed data is a central challenge in machine learning and the sciences, and many existing methods struggle to scale to thousands of samples or hundreds of variables. We introduce two basis-expansion tools for scalable causal discovery. First, the Basis Function BIC (BF-BIC) score uses truncated additive expansions to approximate nonlinear dependencies. BF-BIC is theoretically consistent under additive models and extends to post-nonlinear (PNL) models via an invertible reparameterization. It remains robust under moderate interactions and supports mixed data through a degenerate-Gaussian embedding for discrete variables. In simulations with fully nonlinear neural causal models (NCMs), BF-BIC outperforms kernel- and constraint-based methods (e.g., KCI, RFCI) in both accuracy and runtime. Second, the Basis Function Likelihood Ratio Test (BF-LRT) provides an approximate conditional independence test that is substantially faster than kernel tests while retaining competitive accuracy. Extensive simulations and a real-data application to Canadian wildfire risk show that, when integrated into hybrid searches, BF-based methods enable interpretable and scalable causal discovery. Implementations are available in Python, R, and Java.

Paper Structure

This paper contains 62 sections, 4 theorems, 22 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let $X = f(\text{pa}(X)) + \varepsilon_X$, where $\varepsilon_X$ is independent of $\text{pa}(X)$ and follows an exponential-family distribution. Suppose the structural function $f$ (or, in the PNL case, the transformed function $g^{-1}\!\circ f$) lies in the span of the additive basis functions use

Figures (11)

  • Figure 1: Evaluation plot for small scale continuous simulations. Each statistic plotted is an average over 10 runs and is the point selected by maximizing the F1Adj score.
  • Figure 2: Evaluation plot for small scale mixed simulations. Each statistic plotted is an average over 10 runs and is the point selected by maximizing the F1Adj score.
  • Figure 3: Evaluation Plot for large-N continuous simulations. Each statistic plotted is an average over 10 runs and is the point selected by maximizing the F1Adj score.
  • Figure 4: Evaluation Plot for large-P continuous simulations. Each statistic plotted is an average over 10 runs and is the point selected by maximizing the F1Adj score.
  • Figure 5: BOSS/BF-BIC CPDAG using measured variables only.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Theorem 1: BF-BIC Consistency with Additive Structural Functions
  • Theorem 2: Additive Decomposition of BF-BIC
  • Theorem 3: Consistency of DG Score
  • Theorem 4: BF-LRT Consistency