Sequential Attention for Feature Selection

Taisuke Yasuda; MohammadHossein Bateni; Lin Chen; Matthew Fahrbach; Gang Fu; Vahab Mirrokni

Sequential Attention for Feature Selection

Taisuke Yasuda, MohammadHossein Bateni, Lin Chen, Matthew Fahrbach, Gang Fu, Vahab Mirrokni

TL;DR

Feature selection under budget constraints for neural networks must account for residual feature contributions. The authors propose Sequential Attention, a differentiable, one-pass greedy forward method that uses learnable attention logits to select $k$ features from $d$ and downscale others via a softmax mask, enabling end-to-end optimization. They establish that a regularized linear variant of Sequential Attention is equivalent to Sequential LASSO, which in turn is equivalent to Orthogonal Matching Pursuit (OMP) for least-squares regression, transferring OMP's guarantees to the attention framework. Empirically, Sequential Attention achieves state-of-the-art results on standard neural-network benchmarks and scales to large datasets like Criteo, while maintaining efficiency and revealing the role of overparameterization in attention.

Abstract

Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.

Sequential Attention for Feature Selection

TL;DR

features from

and downscale others via a softmax mask, enabling end-to-end optimization. They establish that a regularized linear variant of Sequential Attention is equivalent to Sequential LASSO, which in turn is equivalent to Orthogonal Matching Pursuit (OMP) for least-squares regression, transferring OMP's guarantees to the attention framework. Empirically, Sequential Attention achieves state-of-the-art results on standard neural-network benchmarks and scales to large datasets like Criteo, while maintaining efficiency and revealing the role of overparameterization in attention.

Abstract

regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.

Paper Structure (35 sections, 5 theorems, 26 equations, 11 figures, 10 tables, 3 algorithms)

This paper contains 35 sections, 5 theorems, 26 equations, 11 figures, 10 tables, 3 algorithms.

Introduction
Sequential Attention.
Theoretical guarantees.
Towards understanding attention.
Connections to overparameterization.
Related work
Preliminaries
Notation.
Feature selection algorithms for linear regression.
Equivalence for least squares: OMP and Sequential Attention
Regularized linear Sequential Attention and Sequential LASSO
Sequential LASSO and OMP
Geometry of Sequential LASSO.
Selection of features in Sequential LASSO.
Experiments
...and 20 more sections

Key Result

Theorem 1.1

For linear regression, regularized linear Sequential Attention is equivalent to OMP.

Figures (11)

Figure 1: Sequential attention applied to model $f(\cdot;\boldsymbol{\theta})$. At each step, the selected features $i \in S$ are used as direct inputs to the model and the unselected features $i \not\in S$ are downscaled by the scalar value $\mathrm{softmax}_i(\mathbf{w},\overline S)$, where $\mathbf{w}\in\mathbb R^d$ is the vector of learned attention weights and $\overline{S} = [d]\setminus S$.
Figure 2: Contour plot of $Q^*(\boldsymbol{\beta} \circ \boldsymbol{\beta})$ for $\boldsymbol{\beta} \in \mathbb{R}^2$ at different zoom-levels of $|\boldsymbol{\beta}_{i}|$.
Figure 3: Feature selection results for small-scale neural network experiments. Here, SA = Sequential Attention, LLY = LLY2021, GL = Group LASSO, SL = Sequential LASSO, OMP = OMP, and CAE = Concrete Autoencoder ABZ2019.
Figure 4: AUC and log loss when selecting $k \in \{10, 15, 20, 25, 30, 35\}$ features for Criteo dataset.
Figure 5: Visualizations of the $k=50$ pixels selected by the feature selection algorithms on MNIST.
...and 6 more figures

Theorems & Definitions (10)

Theorem 1.1
Theorem 1.2
Definition 3.1: Regularized linear Sequential Attention
Lemma 3.2
Theorem 3.3
Remark 1
Lemma 3.4: Projection residuals of the Sequential LASSO
proof : Proof of Lemma \ref{['lem:proj-res']}
proof : Proof of Lemma \ref{['lem:Q-star']}
Definition B.1: Marginal gains

Sequential Attention for Feature Selection

TL;DR

Abstract

Sequential Attention for Feature Selection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (10)