AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs

Victor Akinwande; J. Zico Kolter

AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs

Victor Akinwande, J. Zico Kolter

TL;DR

This paper tackles the scalability gap in causal discovery by GPU-accelerating LiNGAM methods, preserving identifiability guarantees while dramatically speeding up the causal-ordering step. It implements and analyzes efficient DirectLiNGAM and VarLiNGAM kernels on GPUs, achieving up to ~32x and ~30x speed-ups respectively without altering the underlying algorithms. Through experiments on large-scale gene expression data with genetic interventions ($d \approx 964$) and stock market data ($d=487$), the approach demonstrates competitive performance against continuous-optimization baselines and enables practical application to real-world, high-dimensional datasets. The work also provides detailed CUDA implementation guidance and discusses future improvements in I/O awareness, suggesting broad potential impact for LiNGAM-based causal discovery in domains requiring both speed and identifiability.

Abstract

Existing causal discovery methods based on combinatorial optimization or search are slow, prohibiting their application on large-scale datasets. In response, more recent methods attempt to address this limitation by formulating causal discovery as structure learning with continuous optimization but such approaches thus far provide no statistical guarantees. In this paper, we show that by efficiently parallelizing existing causal discovery methods, we can in fact scale them to thousands of dimensions, making them practical for substantially larger-scale problems. In particular, we parallelize the LiNGAM method, which is quadratic in the number of variables, obtaining up to a 32-fold speed-up on benchmark datasets when compared with existing sequential implementations. Specifically, we focus on the causal ordering subprocedure in DirectLiNGAM and implement GPU kernels to accelerate it. This allows us to apply DirectLiNGAM to causal inference on large-scale gene expression data with genetic interventions yielding competitive results compared with specialized continuous optimization methods, and Var-LiNGAM for causal discovery on U.S. stock data.

AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs

TL;DR

) and stock market data (

), the approach demonstrates competitive performance against continuous-optimization baselines and enables practical application to real-world, high-dimensional datasets. The work also provides detailed CUDA implementation guidance and discusses future improvements in I/O awareness, suggesting broad potential impact for LiNGAM-based causal discovery in domains requiring both speed and identifiability.

Abstract

Paper Structure (15 sections, 3 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 3 equations, 4 figures, 2 tables, 1 algorithm.

Introduction
Background
Causal discovery based on Functional Causal Models (FCM)
Standard LiNGAM implementation
GPU execution model
Continuous optimization based structure learning
AcceleratedLiNGAM: Analysis, and Extensions
Efficient DirectLiNGAM implementation
Extension: Efficient VarLiNGAM implementation
Low level CUDA implementation details
Experiments
AcceleratedLiNGAM to gene expression data with genetic interventions
AcceleratedLiNGAM to stock data with auto-regression modelling
Conclusion
Acknowledgments

Figures (4)

Figure 1: Illustration of the causal asymmetry principle underpinning LiNGAM. Given data generated according a LiNGAM functional causal model as in Eqn. \ref{['eq:fcm']}, the regression residual can only be independent of the independent variable in the correct causal direction (top figure). This holds for any distribution of the noise except Gaussian. Independence is measured using the Mutual Information (MI).
Figure 2: Benchmark of CPU (sequential) implementation of DirectLiNGAM. Given data with specified number of samples and dimensions, the causal ordering sub-procedure accounts for up to 96.0% of overall runtime (top-left). It takes 7.0 hours on a CPU to process a dataset of 1.0 million samples with 100.0 variables (top-right). Benchmark of GPU (parallel) implementation of DirectLiNGAM (bottom-left) and VarLiNGAM (bottom-right). Given data with specified number of samples and dimensions, the parallel implementation achieves up to 32.0 times speed-up when compared to the sequential implementation. The benchmark is obtained using an NVIDIA RTX 6000 Ada with 18176.0 cores.
Figure 3: Comparison of parallel and sequential implementation of DirectLiNGAM. We simulate data according to a linear FCM with 10000.0 samples, and 10.0 variables. Both implementations produce the exact same result (top). Benchmark of CPU (sequential) implementation of VarLiNGAM. Given data with specified number of samples and dimensions, the causal ordering sub-procedure of DirectLiNGAM accounts for up to 96.0% of overall runtime (bottom-left). It takes 7.0 hours on a CPU to process a dataset of 1.0 million samples with 100.0 variables (bottom-right).
Figure 4: In and out degree distribution of the adjacency matrix obtained using VarLiNGAM on S&P 500 hourly data. We observe some level of symmetry between in-degree and out-degree, and the fairly uniform distribution across a range of degrees indicates that there are no prominent hubs that significantly stand out.

AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs

TL;DR

Abstract

AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (4)