KL-BSS: Rethinking optimality for neighbourhood selection in structural equation models

Ming Gao; Wai Ming Tai; Bryon Aragam

KL-BSS: Rethinking optimality for neighbourhood selection in structural equation models

Ming Gao, Wai Ming Tai, Bryon Aragam

TL;DR

This work tackles neighbourhood selection in linear SEMs, where dependence among covariates challenges standard support-recovery methods like BSS and the Lasso. It introduces KL-BSS, a KL-divergence inspired estimator that augments the BSS framework with a beta-min constrained score, enabling it to exploit unknown SEM structure. The authors establish pointwise and minimax sample complexities via eigenvalues $\\lambda_K(\\Sigma)$ and $\\lambda_B(\\Sigma)$, showing KL-BSS achieves strictly better performance on designs in $\\Omega_\\Delta$ and is minimax-optimal over a broad class. They also provide practical MIP implementations and extensions to unknown sparsity and $\\beta_{\\min}$, with extensive simulations and a pan-cancer data application demonstrating improvements in both recovery and downstream prediction. Overall, KL-BSS advances neighbourhood selection in SEMs by leveraging latent structure even when unknown, with broad implications for causal structure learning.

Abstract

We introduce a new method for neighbourhood selection in linear structural equation models that improves over classical methods such as best subset selection (BSS) and the Lasso. Our method, called KL-BSS, takes advantage of the existence of underlying structure in SEM -- even when this structure is unknown -- and is easily implemented using existing solvers. Under weaker eigenvalue conditions compared to BSS and the Lasso, KL-BSS can provably recover the support of linear models with fewer samples. We establish both the pointwise and minimax sample complexity for recovery, which KL-BSS obtains. Extensive experiments on both real and simulated data confirm the improvements offered by KL-BSS. While it is well-known that the Lasso encounters difficulties under structured dependencies, it is less well-known that even BSS runs into trouble as well, and can be substantially improved. These results have implications for structure learning in graphical models, which often relies on neighbourhood selection as a subroutine.

KL-BSS: Rethinking optimality for neighbourhood selection in structural equation models

TL;DR

and

, showing KL-BSS achieves strictly better performance on designs in

and is minimax-optimal over a broad class. They also provide practical MIP implementations and extensions to unknown sparsity and

, with extensive simulations and a pan-cancer data application demonstrating improvements in both recovery and downstream prediction. Overall, KL-BSS advances neighbourhood selection in SEMs by leveraging latent structure even when unknown, with broad implications for causal structure learning.

Abstract

Paper Structure (74 sections, 27 theorems, 213 equations, 15 figures, 1 table, 6 algorithms)

This paper contains 74 sections, 27 theorems, 213 equations, 15 figures, 1 table, 6 algorithms.

Introduction
Overview
Contributions
Related work
Outline of the paper
Notation
Preliminaries
Graphical models
Neighbourhood selection
Problem setup
KL-BSS: Support recovery in SEM
Comparing two candidates
The proposed estimator
Comparison with BSS
Analysis of KL-BSS
...and 59 more sections

Key Result

Proposition 2.1

If $(Z_k,Z_A)\sim \mathcal{N}(\mathbf{0},\Gamma)$ and $\Gamma$ is a positive definite covariance matrix, then for any subset $S\subseteq A$, the following are equivalent: Moreover, suppose $P(Z)$ is an SEM by eq:pre:sem with $b_{jk}\ne0,\forall j\in\mathop{\mathrm{pa}}\nolimits(k)$. If $A=\mathop{\mathrm{nd}}\nolimits(k)$, then $S(k;A)=\mathop{\mathrm{pa}}\nolimits(k)=\mathop{\mathrm{supp}}\nolim

Figures (15)

Figure 1: Overview of SEM and improvement of KL-BSS. (Left) An example SEM over $d=11$ nodes. The target variable is $Y$, the neighbourhood of $Y$ is $S_1=\{X_1,X_2\}$, and the remaining nodes $X=(X_3,\ldots,X_{10})$ are shaded. The (partial) regression coefficients $\mathbb{E}\widehat{\beta}(S)=\mathbb{E}(X_S^\top X_S)^{-1}X_S^\top Y$ are computed for the support candidates $S=S_j (j=1,2,\ldots,5)$. For simplicity, we only present a subset of all possible supports. (Right) KL-BSS strictly improves over BSS in support recovery: An illustration of the improvement for both sparse and dense graphs, summarized from the results in Section \ref{['sec:expt:simu']}.
Figure 2: A graphical model over $Z=(Z_1,\ldots,Z_d)$ with one more target node $Z_{d+1}$ appended to it. The corresponding $G$ will refer to a DAG over $Z$ (ignoring $Z_{d+1}$). The Markov boundary of $Z_{d+1}$ is $\{Z_2,Z_{d-2},Z_{d-1}\}$ under this model.
Figure 3: (Left) The DAG of the SEM in Example \ref{['ex:pathcancel']} with $d=2s$ nodes. The true parents (support) of $Y$ are $S_{*}=\{X_1,X_2,\ldots,X_{s-1},X_{s}\}$ and the remaining nodes are shaded. The edges from $S_{*}$ to $Y$ are in bold. (Right) Recovery performance of KL-BSS and BSS in terms of parameters $k$ and $b$: Shaded regions indicate parameters for which each method achieves a fixed recovery probability. KL-BSS is independent of $b$ while the performance of BSS quickly degrades as $b^2$ increases. The unshaded region on the right indicates the parameter tuples $(k,b)$ for which neither method achieves the same recovery probability.
Figure 4: Comparison on support recovery performance of BSS, KL-BSS and Lasso under different types of graphs and dimensions $(d,s,\overline{s})$ averaged over $200$ replications. The horizontal axis is sample size, the vertical axis is probability of exact recovery. The first/middle/last two columns are for ER graph, SF graph, and complete graph. There is a notable performance gap between KL-BSS and BSS. Lasso is never consistent.
Figure 5: Left panel: Effect of unknown sparsity on recovery performance. KL-BSS and BSS with various specifications of $\overline{s}$, indicated by the opacity. The performance of each methods is robust to the given sparsity upper bound (lines are overlapped due to similar performances). Middle panel: Cross-validation for the choice of $\beta_{\min}$. The solid lines plot KL-BSS with correct $\beta_{\min}$. The dashed lines plot the CV performance. The thinner lines plot KL-BSS with each candidate of $\beta_{\min}$'s, ranging from red to green and to cyan. The CV estimate is slightly insuperior to KL-BSS with correct $\beta_{\min}$, but still performs better than BSS. Right panel: Time complexity in log-log plot (dark blue/red solid lines) and recovery performance (light blue/red dashed lines) of KL-BSS/BSS using MIP. KL-BSS runs in the same computation order as BSS, incurring a small overhead while achieving better recovery performance.
...and 10 more figures

Theorems & Definitions (67)

Remark 1.1
Example 1
Proposition 2.1
Remark 2.1
Remark 2.2
Remark 3.1
Definition 1
Lemma 4.1
Theorem 4.2
Theorem 4.3
...and 57 more

KL-BSS: Rethinking optimality for neighbourhood selection in structural equation models

TL;DR

Abstract

KL-BSS: Rethinking optimality for neighbourhood selection in structural equation models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (67)