Optimal Kernel Choice for Score Function-based Causal Discovery

Wenjie Wang; Biwei Huang; Feng Liu; Xinge You; Tongliang Liu; Kun Zhang; Mingming Gong

Optimal Kernel Choice for Score Function-based Causal Discovery

Wenjie Wang, Biwei Huang, Feng Liu, Xinge You, Tongliang Liu, Kun Zhang, Mingming Gong

TL;DR

This work tackles the kernel-parameter selection problem in RKHS-based score functions for causal discovery. It introduces a mutual-information-based objective that treats the causal relation as a mixture of independent noises and uses a Gaussian process prior to model the nonlinear mapping, maximizing the joint marginal likelihood $p(X, PA)$ to automatically learn kernel parameters. The authors prove local consistency of the resulting score and demonstrate, through synthetic and real benchmarks, that automatic kernel learning outperforms median-heuristic kernel choices and prior RKHS-based scores, particularly in dense graphs. The approach yields more accurate causal graphs from observational data and reduces reliance on manual, heuristic kernel tuning, offering practical benefits for scalable causal discovery in diverse data regimes.

Abstract

Score-based methods have demonstrated their effectiveness in discovering causal relationships by scoring different causal structures based on their goodness of fit to the data. Recently, Huang et al. proposed a generalized score function that can handle general data distributions and causal relationships by modeling the relations in reproducing kernel Hilbert space (RKHS). The selection of an appropriate kernel within this score function is crucial for accurately characterizing causal relationships and ensuring precise causal discovery. However, the current method involves manual heuristic selection of kernel parameters, making the process tedious and less likely to ensure optimality. In this paper, we propose a kernel selection method within the generalized score function that automatically selects the optimal kernel that best fits the data. Specifically, we model the generative process of the variables involved in each step of the causal graph search procedure as a mixture of independent noise variables. Based on this model, we derive an automatic kernel selection method by maximizing the marginal likelihood of the variables involved in each search step. We conduct experiments on both synthetic data and real-world benchmarks, and the results demonstrate that our proposed method outperforms heuristic kernel selection methods.

Optimal Kernel Choice for Score Function-based Causal Discovery

TL;DR

to automatically learn kernel parameters. The authors prove local consistency of the resulting score and demonstrate, through synthetic and real benchmarks, that automatic kernel learning outperforms median-heuristic kernel choices and prior RKHS-based scores, particularly in dense graphs. The approach yields more accurate causal graphs from observational data and reduces reliance on manual, heuristic kernel tuning, offering practical benefits for scalable causal discovery in diverse data regimes.

Abstract

Paper Structure (29 sections, 2 theorems, 37 equations, 8 figures)

This paper contains 29 sections, 2 theorems, 37 equations, 8 figures.

Introduction
Background
Conditional Cross-covariance Operator on RKHS
Regression in RKHS
Motivation
Optimal Kernel Selection via Minimizing Mutual Information
Preliminaries
Mutual information-based score function
Search Procedure
Comparison with existing score functions
Experimental Results
Synthetic Data
Real Benchmark Datasets
Computation Analysis
Conclusion
...and 14 more sections

Key Result

Lemma 2.1

fukumizu2004dimensionality Let $(\mathcal{H_X}, k_\mathcal{X}), (\mathcal{H_Y}, k_\mathcal{Y})$ and $(\mathcal{H_Z}, k_\mathcal{Z})$ be reproducing kernel Hilbert spaces over measurable spaces $\mathcal{X, Y}$ and $\mathcal{Z}$, with continuous and bounded kernels. Let $X$ , $Y$ and $Z$ be random va And if further $k_\mathcal{X}$ is characteristic kernel fukumizu2007kernel, the following equation

Figures (8)

Figure 1: Visualization of features and estimated noise using median heuristic or learnable bandwidth. (a) Scatter plot of original data $Y$ and $X$. (b) Scatter plot of the projected features $f(Y)$ and $\bm{k}_x$ using conditional likelihood-based score with trainable $k_\mathcal{X}$. (c) Scatter plot of the estimated regression noise $\Tilde{\varepsilon}_{X|Y}$ and $Z$ with median heuristic bandwidth. We utilized HSIC to quantify the independence between $\Tilde{\varepsilon}_{X|Y}$ and $Z$. A lower HSIC value indicates a higher degree of independence between them. (d) Scatter plot of $\Tilde{\varepsilon}_{X|Y}$ and $Z$ with trainable $k_\mathcal{X}$ using our proposed score function (Ours).
Figure 2: The F1 score of recovered causal graphs on: (a.1) Continuous data with sample size $n = 200$ and (a.2) $n=500$; (b.1) mixed data with $n = 200$ and (b.2) $n=500$; and (c.1) multi-dimensional data with $n = 200$ and (c.2) $n=500$. The x-axis represents the graph density and the y-axis is the F1 score; higher F1 scores indicate higher accuracy. Shaded regions show standard errors for the mean.
Figure 3: The normalized SHD of recovered causal graphs on synthetic data with different data types and sample sizes. The y-axis is the normalized SHD score and the lower SHD score means better accuracy.
Figure 4: Results on benchmarks (a) SACH and (b) CHILD with different sample sizes. A higher F1 score or a lower SHD score indicates better performance. Comparison between Marg and our method on (c) convergence time for one single edge and (d) the overall search time for the entire graph.
Figure 5: Results on synthetic discrete dataset. All the variables are discrete in the graph, with the value either from $[1, 5]$ or $[1, 20]$. The F1 score of recovered causal graphs with sample size (a.1) $n = 200$ and (a.2) $n=500$ and (a.3) $n=1000$, where a higher F1 score $\uparrow$ indicates greater accuracy. The normalized SHD score $\downarrow$ with different samples are presented in (b.1) $n = 200$ and (b.2) $n=500$ and (b.3) $n=1000$, with a lower SHD score signifying better accuracy. The x-axis is the graph density. Shaded regions show standard errors for the mean.
...and 3 more figures

Theorems & Definitions (3)

Lemma 2.1
Definition 4.1
Lemma 4.2

Optimal Kernel Choice for Score Function-based Causal Discovery

TL;DR

Abstract

Optimal Kernel Choice for Score Function-based Causal Discovery

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (3)