Scalable Causal Discovery from Recursive Nonlinear Data via Truncated Basis Function Scores and Tests

Joseph Ramsey; Bryan Andrews; Peter Spirtes

Scalable Causal Discovery from Recursive Nonlinear Data via Truncated Basis Function Scores and Tests

Joseph Ramsey, Bryan Andrews, Peter Spirtes

TL;DR

This work introduces BF-BIC and BF-LRT, two basis-expansion tools for scalable causal discovery in nonlinear, mixed-data settings. BF-BIC provides a scalable, consistent score for additive nonlinear SEMs by embedding continuous variables in truncated Legendre bases, while BF-LRT delivers fast, asymptotically valid conditional independence testing via a likelihood-ratio framework on the same basis representations; both extend to post-nonlinear models through invertible transforms. The methods are integrated with efficient search algorithms (BOSS for score-based and PC-Max for constraint-based) to recover CPDAGs in large graphs, with extensive simulations showing favorable accuracy and runtime relative to kernel-based methods. A real-data application to wildfire risk demonstrates interpretable nonlinear causal structures and latent-variable considerations via FCIT and PAGs. The results suggest practical, scalable tools for causal discovery in complex scientific domains, with broad potential extensions to latent confounding, regime heterogeneity, and hybrid score-test strategies.

Abstract

Learning graphical conditional independence structures from nonlinear, continuous or mixed data is a central challenge in machine learning and the sciences, and many existing methods struggle to scale to thousands of samples or hundreds of variables. We introduce two basis-expansion tools for scalable causal discovery. First, the Basis Function BIC (BF-BIC) score uses truncated additive expansions to approximate nonlinear dependencies. BF-BIC is theoretically consistent under additive models and extends to post-nonlinear (PNL) models via an invertible reparameterization. It remains robust under moderate interactions and supports mixed data through a degenerate-Gaussian embedding for discrete variables. In simulations with fully nonlinear neural causal models (NCMs), BF-BIC outperforms kernel- and constraint-based methods (e.g., KCI, RFCI) in both accuracy and runtime. Second, the Basis Function Likelihood Ratio Test (BF-LRT) provides an approximate conditional independence test that is substantially faster than kernel tests while retaining competitive accuracy. Extensive simulations and a real-data application to Canadian wildfire risk show that, when integrated into hybrid searches, BF-based methods enable interpretable and scalable causal discovery. Implementations are available in Python, R, and Java.

Scalable Causal Discovery from Recursive Nonlinear Data via Truncated Basis Function Scores and Tests

TL;DR

Abstract

Scalable Causal Discovery from Recursive Nonlinear Data via Truncated Basis Function Scores and Tests

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (4)