Causal Discovery for Cross-Sectional Data Based on Super-Structure and Divide-and-Conquer
Wenyu Wang, Yaping Wan
TL;DR
This work tackles the computational bottleneck of building accurate Super-Structures for cross-sectional causal discovery by proposing a lightweight, weakly constrained-Super-Structure framework that enables divide-and-conquer without heavy domain knowledge. It integrates a Chow–Liu MST-based scaffold built on Copula entropy, Girvan–Newman partitioning, and a two-phase subgraph learning plus Shah-style merging to reduce CI testing while maintaining competitive structural accuracy. Across synthetic benchmarks and real-world data (CHARLS), the approach achieves substantial reductions in non-redundant CI tests with only modest losses in accuracy compared to strong baselines like PC and FCI, demonstrating practical scalability in knowledge-scarce domains. The results suggest a path toward scalable causal discovery in large biomedical and social science datasets, with future work focusing on improving the fidelity of weak scaffolds and refining the merge step.
Abstract
This paper tackles a critical bottleneck in Super-Structure-based divide-and-conquer causal discovery: the high computational cost of constructing accurate Super-Structures--particularly when conditional independence (CI) tests are expensive and domain knowledge is unavailable. We propose a novel, lightweight framework that relaxes the strict requirements on Super-Structure construction while preserving the algorithmic benefits of divide-and-conquer. By integrating weakly constrained Super-Structures with efficient graph partitioning and merging strategies, our approach substantially lowers CI test overhead without sacrificing accuracy. We instantiate the framework in a concrete causal discovery algorithm and rigorously evaluate its components on synthetic data. Comprehensive experiments on Gaussian Bayesian networks, including magic-NIAB, ECOLI70, and magic-IRRI, demonstrate that our method matches or closely approximates the structural accuracy of PC and FCI while drastically reducing the number of CI tests. Further validation on the real-world China Health and Retirement Longitudinal Study (CHARLS) dataset confirms its practical applicability. Our results establish that accurate, scalable causal discovery is achievable even under minimal assumptions about the initial Super-Structure, opening new avenues for applying divide-and-conquer methods to large-scale, knowledge-scarce domains such as biomedical and social science research.
