Table of Contents
Fetching ...

Improving variable selection properties by leveraging external data

Paul Rognon-Vael, David Rossell, Piotr Zwiernik

TL;DR

This work shows how external information that partitions parameters into blocks can relax the stringent sparsity and signal-strength requirements in high-dimensional variable selection. By introducing block-specific, non-exchangeable $\ell_0$ penalties, the authors demonstrate oracle and empirical Bayes procedures that achieve model selection consistency under milder conditions and faster convergence rates than standard penalties. The analysis spans the Gaussian sequence model and high-dimensional linear regression under arbitrary design, with rigorous sufficient and necessary conditions, as well as data-driven strategies that estimate block sparsity and adapt penalties accordingly. The results provide a theoretical foundation for data integration and transfer-learning approaches in structural learning, while offering practical, computation-friendly procedures (e.g., MCMC-based schemes) for scalable inference in complex models.

Abstract

Sparse high-dimensional signal recovery is only possible under certain conditions on the number of parameters, sample size, signal strength and underlying sparsity. We show that leveraging external information, as possible with data integration or transfer learning, allows to push these mathematical limits. Specifically, we consider external information that allows splitting parameters into blocks, first in a simplified case, the Gaussian sequence model, and then in the general linear regression setting. We show how external information dependent, block-based, $\ell_0$ penalties attain model selection consistency under milder conditions than standard $\ell_0$ penalties, and they also attain faster model recovery rates. We first provide results for oracle-based $\ell_0$ penalties that have access to perfect sparsity and signal strength information. Subsequently, we propose an empirical Bayes data analysis method that does not require oracle information and for which efficient computation is possible via standard MCMC techniques. Our results provide a mathematical basis to justify the use of data integration methods in high-dimensional structural learning.

Improving variable selection properties by leveraging external data

TL;DR

This work shows how external information that partitions parameters into blocks can relax the stringent sparsity and signal-strength requirements in high-dimensional variable selection. By introducing block-specific, non-exchangeable penalties, the authors demonstrate oracle and empirical Bayes procedures that achieve model selection consistency under milder conditions and faster convergence rates than standard penalties. The analysis spans the Gaussian sequence model and high-dimensional linear regression under arbitrary design, with rigorous sufficient and necessary conditions, as well as data-driven strategies that estimate block sparsity and adapt penalties accordingly. The results provide a theoretical foundation for data integration and transfer-learning approaches in structural learning, while offering practical, computation-friendly procedures (e.g., MCMC-based schemes) for scalable inference in complex models.

Abstract

Sparse high-dimensional signal recovery is only possible under certain conditions on the number of parameters, sample size, signal strength and underlying sparsity. We show that leveraging external information, as possible with data integration or transfer learning, allows to push these mathematical limits. Specifically, we consider external information that allows splitting parameters into blocks, first in a simplified case, the Gaussian sequence model, and then in the general linear regression setting. We show how external information dependent, block-based, penalties attain model selection consistency under milder conditions than standard penalties, and they also attain faster model recovery rates. We first provide results for oracle-based penalties that have access to perfect sparsity and signal strength information. Subsequently, we propose an empirical Bayes data analysis method that does not require oracle information and for which efficient computation is possible via standard MCMC techniques. Our results provide a mathematical basis to justify the use of data integration methods in high-dimensional structural learning.

Paper Structure

This paper contains 65 sections, 36 theorems, 348 equations, 3 figures, 3 tables.

Key Result

Proposition 3.1

In the sequence model eq:GSM, let $\hat{S}^b$ and $\kappa_1,\ldots,\kappa_b$ defined in eq:Mhat. Then $\hat{S}^b=\hat{S}^b_1\cup \cdots \cup \hat{S}^b_b$, where, for each $j=1,\ldots,b$,

Figures (3)

  • Figure 1: Smallest (dashed) and largest (solid) value of $\tau$ leading to consistent model recovery in Examples 1 to 4, as given in \ref{['eq:rangethreshex']} and \ref{['eq:rangeblockthreshex']}. Red indicates settings where the interval is empty
  • Figure 2: Ratio of smallest signals recoverable (left) and oracle convergence rates (right) with $\hat{S}^b$ and with $\hat{S}$ in Examples 1--4
  • Figure 3: Probability of correct selection with $\hat{S}^{EB,b}$ (solid black), $\hat{S}^{A,b}$ (dashed black), $\hat{S}^{EB}$ (solid grey), $\hat{S}^{A}$ (dashed grey), and the EBIC penalty (dotted grey) in Example 1 (top left), 2 (top right), 4 (bottom left) and 5 (bottom right)

Theorems & Definitions (47)

  • Proposition 3.1
  • Proposition 3.2
  • Lemma 3.3
  • Theorem 3.4
  • Corollary 3.5
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Lemma 4.4
  • Theorem 4.5
  • ...and 37 more