Table of Contents
Fetching ...

Provable FDR Control for Deep Feature Selection: Deep MLPs and Beyond

Kazuma Sawaya

TL;DR

This work introduces a provably reliable feature-selection framework for deep neural networks that approximately controls the FDR. By leveraging gradient-based input sensitivity and a data-splitting scheme, the authors establish a marginal asymptotic normality result for null features within a broad class of architectures that share a dense first layer and a $ m{B}$‑ROI design. The method yields an FDR-controlling selection rule with minimal distributional assumptions and demonstrates strong empirical support across multiple network types and data Generating Processes. The approach advances the interpretability and reproducibility of deep-feature selection by combining modeling flexibility with statistical guarantees, while outlining practical limitations and future extensions.

Abstract

We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected. From the second layer onward, it accommodates multilayer perceptrons (MLPs) of arbitrary width and depth, convolutional and recurrent networks, attention mechanisms, residual connections, and dropout. The procedure also accommodates stochastic gradient descent with data-independent initializations and learning rates. To the best of our knowledge, this is the first work to provide a theoretical guarantee of FDR control for feature selection within such a general deep learning setting. Our analysis is built upon a multi-index data-generating model and an asymptotic regime in which the feature dimension $n$ diverges faster than the latent dimension $q^{*}$, while the sample size, the number of training iterations, the network depth, and hidden layer widths are left unrestricted. Under this setting, we show that each coordinate of the gradient-based feature-importance vector admits a marginal normal approximation, thereby supporting the validity of asymptotic FDR control. As a theoretical limitation, we assume $\mathbf{B}$-right orthogonal invariance of the design matrix, and we discuss broader generalizations. We also present numerical experiments that underscore the theoretical findings.

Provable FDR Control for Deep Feature Selection: Deep MLPs and Beyond

TL;DR

This work introduces a provably reliable feature-selection framework for deep neural networks that approximately controls the FDR. By leveraging gradient-based input sensitivity and a data-splitting scheme, the authors establish a marginal asymptotic normality result for null features within a broad class of architectures that share a dense first layer and a ‑ROI design. The method yields an FDR-controlling selection rule with minimal distributional assumptions and demonstrates strong empirical support across multiple network types and data Generating Processes. The approach advances the interpretability and reproducibility of deep-feature selection by combining modeling flexibility with statistical guarantees, while outlining practical limitations and future extensions.

Abstract

We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected. From the second layer onward, it accommodates multilayer perceptrons (MLPs) of arbitrary width and depth, convolutional and recurrent networks, attention mechanisms, residual connections, and dropout. The procedure also accommodates stochastic gradient descent with data-independent initializations and learning rates. To the best of our knowledge, this is the first work to provide a theoretical guarantee of FDR control for feature selection within such a general deep learning setting. Our analysis is built upon a multi-index data-generating model and an asymptotic regime in which the feature dimension diverges faster than the latent dimension , while the sample size, the number of training iterations, the network depth, and hidden layer widths are left unrestricted. Under this setting, we show that each coordinate of the gradient-based feature-importance vector admits a marginal normal approximation, thereby supporting the validity of asymptotic FDR control. As a theoretical limitation, we assume -right orthogonal invariance of the design matrix, and we discuss broader generalizations. We also present numerical experiments that underscore the theoretical findings.

Paper Structure

This paper contains 30 sections, 17 theorems, 113 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Under Assumptions asmp:DGP--asmp:init, for each $m,n,q^*\in\mathbb{N}$ and iteration $t\in\mathbb{N}$ of the SGD with/without replacement, conditioning on the learning-rate and mini-batch schedule, is uniformly distributed on the unit sphere lying in $\mathrm{Col}(\bm{B})^\perp$. Here, $\mathrm{Col}(\bm{B})^\perp$ is the orthogonal complement of the column space of $\bm{B}$, and $\bm{P}_{\bm{B}}^

Figures (10)

  • Figure 1: A schematic illustration of $\bm{\xi}^{(t)}$ for $q^* = 1$ and $n = 3$. $\bm{U}\in\mathbb{R}^{n\times n}$ is any orthogonal matrix such that $\bm{U}\bm{B}=\bm{B}$ (i.e., rotation around $\bm{B}$).
  • Figure 2: Histograms of the empirical distribution of ${\sqrt{n}\xi_j^{(10)}}/{\|\bm{P}_{\bm{B}}^\perp\bm{\xi}^{(10)}\|}$ for $j\in S^\mathsf{c}$. The solid black curve shows the $\mathcal{N}(0,1)$ density. The solid red curve represents a normal density fitted to the histograms, and the dotted blue line indicates the empirical mean.
  • Figure 3: Results for the false discovery rate (left) and power (right) when performing feature selection at each iteration using the artificial data and models defined in Section \ref{['sec:numer-asyN']}. The solid curves represent averages over 20 independent runs, and the shaded areas indicate one standard deviation around the mean.
  • Figure 4: Histograms of the empirical distribution of ${\sqrt{n}\xi_j^{(10)}}/{\|\bm{P}_{\bm{B}}^\perp\bm{\xi}^{(10)}\|}$ for $j\in S^\mathsf{c}$. The solid black curve shows the $\mathcal{N}(0,1)$ density. The solid red curve represents a normal density fitted to the histograms.
  • Figure 5: Results for the FDR/power (left) and the training loss (right) when performing feature selection at each iteration. The solid curves represent averages over 20 independent runs, and the shaded areas indicate one standard deviation around the mean.
  • ...and 5 more figures

Theorems & Definitions (38)

  • Definition 1: Multi-index model
  • Definition 2
  • Proposition 1
  • Theorem 1
  • Theorem 2
  • Remark 5.1
  • proof : Proof of Proposition \ref{['prop:unif']}
  • Lemma A.1
  • proof : Proof of Lemma \ref{['lem:hanson-wright']}
  • proof : Proof of Theorem \ref{['thm:asyN']}
  • ...and 28 more