Table of Contents
Fetching ...

EntryPrune: Neural Network Feature Selection using First Impressions

Felix Zimmer, Patrik Okanovic, Torsten Hoefler

TL;DR

EntryPrune tackles supervised feature selection in neural networks by enforcing a dynamically sparse input layer and ranking candidate features via entry-based pruning. It introduces two coupled processes—continuous weight optimization and discrete, history-dependent mask updates—alongside random regrowth and a flexible variant (EntryPrune flex) that adapts the input size during training, guided by hyperparameters $K$ and $c_{\text{ratio}}$. Empirically, EntryPrune and EntryPrune flex outperform state-of-the-art baselines on long datasets and remain competitive on wide datasets, with lower runtimes than several competing methods. While effective in dense-backbone settings, the approach is not compatible with weight-sharing first layers (e.g., CNNs, Vision Transformers), limiting applicability to those architectures; the code is publicly available.

Abstract

There is an ongoing effort to develop feature selection algorithms to improve interpretability, reduce computational resources, and minimize overfitting in predictive models. Neural networks stand out as architectures on which to build feature selection methods, and recently, neuron pruning and regrowth have emerged from the sparse neural network literature as promising new tools. We introduce EntryPrune, a novel supervised feature selection algorithm using a dense neural network with a dynamic sparse input layer. It employs entry-based pruning, a novel approach that compares neurons based on their relative change induced when they have entered the network. Extensive experiments on 13 different datasets show that our approach generally outperforms the current state-of-the-art methods, and in particular improves the average accuracy on low-dimensional datasets. Furthermore, we show that EntryPruning surpasses traditional techniques such as magnitude pruning within the EntryPrune framework and that EntryPrune achieves lower runtime than competing approaches. Our code is available at https://github.com/flxzimmer/entryprune.

EntryPrune: Neural Network Feature Selection using First Impressions

TL;DR

EntryPrune tackles supervised feature selection in neural networks by enforcing a dynamically sparse input layer and ranking candidate features via entry-based pruning. It introduces two coupled processes—continuous weight optimization and discrete, history-dependent mask updates—alongside random regrowth and a flexible variant (EntryPrune flex) that adapts the input size during training, guided by hyperparameters and . Empirically, EntryPrune and EntryPrune flex outperform state-of-the-art baselines on long datasets and remain competitive on wide datasets, with lower runtimes than several competing methods. While effective in dense-backbone settings, the approach is not compatible with weight-sharing first layers (e.g., CNNs, Vision Transformers), limiting applicability to those architectures; the code is publicly available.

Abstract

There is an ongoing effort to develop feature selection algorithms to improve interpretability, reduce computational resources, and minimize overfitting in predictive models. Neural networks stand out as architectures on which to build feature selection methods, and recently, neuron pruning and regrowth have emerged from the sparse neural network literature as promising new tools. We introduce EntryPrune, a novel supervised feature selection algorithm using a dense neural network with a dynamic sparse input layer. It employs entry-based pruning, a novel approach that compares neurons based on their relative change induced when they have entered the network. Extensive experiments on 13 different datasets show that our approach generally outperforms the current state-of-the-art methods, and in particular improves the average accuracy on low-dimensional datasets. Furthermore, we show that EntryPruning surpasses traditional techniques such as magnitude pruning within the EntryPrune framework and that EntryPrune achieves lower runtime than competing approaches. Our code is available at https://github.com/flxzimmer/entryprune.
Paper Structure (36 sections, 1 equation, 12 figures, 8 tables, 2 algorithms)

This paper contains 36 sections, 1 equation, 12 figures, 8 tables, 2 algorithms.

Figures (12)

  • Figure 1: Visualization of feature selection results on MNIST using 25 out of 784 features. Each panel shows the binary mask (top left) of the selected pixels, along with sample digits where only the selected pixels are visible. The downstream accuracy scores (from an SVM classifier) were computed by training on only these selected features and are averaged across five runs (see Appendix \ref{['sec:a:setup']} for details). Our approach enhances interpretability by identifying a minimal set of critical features that maintain classification performance. Appendix \ref{['sec:a:figcifar']} includes a similar visualization for CIFAR-10.
  • Figure 2: Entry score calculation in EntryPrune, Algorithm \ref{['algo:main']}. The network's input layer includes $K$ selected features plus extra candidates. Over several mini-batches, set by the hyperparameter $n_{\text{mb}}$, first layer gradients ${\bm{\mathsfit{G}}}_k^{(1)}$ are added in a matrix ${\bm{S}}$. We then compute the $L^1$ norm for each input neuron and standardize the resulting vector to get relative change scores ${\bm{s}}$. Candidate scores are entered into entry score vector ${\bm{e}}$. Features with the top $K$ entry scores stay; others are randomly regrown. Candidate feature weights are reinitialized before training continues.
  • Figure 3: EntryPrune runtime metrics. Left: Number of changes to the top $K$ features over time. Right: Minimum entry score (blue) and minimum absolute first layer weight magnitude (red) among top $K$ features. Changes become less frequent as training progresses and the minimum entry score increases. Minimum weight magnitudes increase during stable phases (e.g., updates 47–51).
  • Figure 4: Resulting accuracy for the studied methods by dataset and number of selected features $K$ using the SVM downstream learner. "All Features" is the accuracy using all features in the dataset. Each point shows the mean accuracy across five runs, with error bars indicating the standard deviation. The percentage shown in parentheses after each dataset name indicates the proportion of features that $K=100$ corresponds to, relative to the full feature set. Datasets marked with an asterisk were evaluated with a limited set of baseline methods (see Appendix \ref{['sec:a:setup']}), while baseline results for the other datasets are reproduced from atashgahi2023supervised. Results for all downstream learners are shown in Appendix \ref{['sec:a:resulttables']}.
  • Figure 5: Average accuracy (across all values of $K$) by dataset for the studied methods using the SVM downstream learner. Our proposed methods are "EntryPrune" and "EntryPrune flex". Datasets marked with an asterisk were evaluated with a limited set of baseline methods (see Appendix \ref{['sec:a:setup']}), while baseline results for the other datasets are reproduced from atashgahi2023supervised. Results for all downstream learners are shown in Appendix \ref{['sec:a:resulttables']}.
  • ...and 7 more figures