Table of Contents
Fetching ...

Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection

Xiaochen Zhang, Yunfeng Cai, Haoyi Xiong

TL;DR

Knoop addresses the challenge of variable selection in high-dimensional, highly correlated data by extending Knockoff filters through recursive, multi-layer knockoffs and over-parameterization with Ridgeless regression. It builds an expanded coefficient space, uses anomaly-based testing against the distribution of knockoff coefficients to produce $p$-values, and applies FDR control via BH to select variables. Empirical results demonstrate superior AUC in simulations and strong predictive performance on real-world datasets, showing robustness across regression and classification tasks. The approach offers a scalable, statistically principled mechanism for identifying truly relevant variables while mitigating issues from multicollinearity and model capacity limits.

Abstract

Variable selection plays a crucial role in enhancing modeling effectiveness across diverse fields, addressing the challenges posed by high-dimensional datasets of correlated variables. This work introduces a novel approach namely Knockoff with over-parameterization (Knoop) to enhance Knockoff filters for variable selection. Specifically, Knoop first generates multiple knockoff variables for each original variable and integrates them with the original variables into an over-parameterized Ridgeless regression model. For each original variable, Knoop evaluates the coefficient distribution of its knockoffs and compares these with the original coefficients to conduct an anomaly-based significance test, ensuring robust variable selection. Extensive experiments demonstrate superior performance compared to existing methods in both simulation and real-world datasets. Knoop achieves a notably higher Area under the Curve (AUC) of the Receiver Operating Characteristic (ROC) Curve for effectively identifying relevant variables against the ground truth by controlled simulations, while showcasing enhanced predictive accuracy across diverse regression and classification tasks. The analytical results further backup our observations.

Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection

TL;DR

Knoop addresses the challenge of variable selection in high-dimensional, highly correlated data by extending Knockoff filters through recursive, multi-layer knockoffs and over-parameterization with Ridgeless regression. It builds an expanded coefficient space, uses anomaly-based testing against the distribution of knockoff coefficients to produce -values, and applies FDR control via BH to select variables. Empirical results demonstrate superior AUC in simulations and strong predictive performance on real-world datasets, showing robustness across regression and classification tasks. The approach offers a scalable, statistically principled mechanism for identifying truly relevant variables while mitigating issues from multicollinearity and model capacity limits.

Abstract

Variable selection plays a crucial role in enhancing modeling effectiveness across diverse fields, addressing the challenges posed by high-dimensional datasets of correlated variables. This work introduces a novel approach namely Knockoff with over-parameterization (Knoop) to enhance Knockoff filters for variable selection. Specifically, Knoop first generates multiple knockoff variables for each original variable and integrates them with the original variables into an over-parameterized Ridgeless regression model. For each original variable, Knoop evaluates the coefficient distribution of its knockoffs and compares these with the original coefficients to conduct an anomaly-based significance test, ensuring robust variable selection. Extensive experiments demonstrate superior performance compared to existing methods in both simulation and real-world datasets. Knoop achieves a notably higher Area under the Curve (AUC) of the Receiver Operating Characteristic (ROC) Curve for effectively identifying relevant variables against the ground truth by controlled simulations, while showcasing enhanced predictive accuracy across diverse regression and classification tasks. The analytical results further backup our observations.

Paper Structure

This paper contains 27 sections, 2 theorems, 16 equations, 1 figure, 4 tables, 4 algorithms.

Key Result

Proposition 1

Given the input matrix $X=[X_{1}, \ldots, X_{p}]$, Algorithm alg:hierKnock outputs a matrix $\widetilde{X}$ with the structure $\widetilde{X}=[X_{1}, \ldots, X_{p}, \widetilde{\mathbf{K}}_{1,1}, \ldots, \widetilde{\mathbf{K}}_{1,p}, \ldots, \widetilde{\mathbf{K}}_{k_{\text{max}},1}, \ldots, \widetil and the sub-matrices $[\widetilde{\mathbf{K}}_{j,1},\dots,\widetilde{\mathbf{K}}_{j,p}]$ and $[\wid

Figures (1)

  • Figure 1: A Brief Comparison between Knockoff and the proposed Knoop pipeline

Theorems & Definitions (5)

  • Proposition 1: Exchangability between multi-layered knockoffs
  • proof
  • Definition 1: Anomaly-based Significance Test
  • Proposition 2: FDR Control of Anomaly-based Significance Test
  • proof