Sequential Attention for Feature Selection
Taisuke Yasuda, MohammadHossein Bateni, Lin Chen, Matthew Fahrbach, Gang Fu, Vahab Mirrokni
TL;DR
Feature selection under budget constraints for neural networks must account for residual feature contributions. The authors propose Sequential Attention, a differentiable, one-pass greedy forward method that uses learnable attention logits to select $k$ features from $d$ and downscale others via a softmax mask, enabling end-to-end optimization. They establish that a regularized linear variant of Sequential Attention is equivalent to Sequential LASSO, which in turn is equivalent to Orthogonal Matching Pursuit (OMP) for least-squares regression, transferring OMP's guarantees to the attention framework. Empirically, Sequential Attention achieves state-of-the-art results on standard neural-network benchmarks and scales to large datasets like Criteo, while maintaining efficiency and revealing the role of overparameterization in attention.
Abstract
Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
