Table of Contents
Fetching ...

Sequential Attention for Feature Selection

Taisuke Yasuda, MohammadHossein Bateni, Lin Chen, Matthew Fahrbach, Gang Fu, Vahab Mirrokni

TL;DR

Feature selection under budget constraints for neural networks must account for residual feature contributions. The authors propose Sequential Attention, a differentiable, one-pass greedy forward method that uses learnable attention logits to select $k$ features from $d$ and downscale others via a softmax mask, enabling end-to-end optimization. They establish that a regularized linear variant of Sequential Attention is equivalent to Sequential LASSO, which in turn is equivalent to Orthogonal Matching Pursuit (OMP) for least-squares regression, transferring OMP's guarantees to the attention framework. Empirically, Sequential Attention achieves state-of-the-art results on standard neural-network benchmarks and scales to large datasets like Criteo, while maintaining efficiency and revealing the role of overparameterization in attention.

Abstract

Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.

Sequential Attention for Feature Selection

TL;DR

Feature selection under budget constraints for neural networks must account for residual feature contributions. The authors propose Sequential Attention, a differentiable, one-pass greedy forward method that uses learnable attention logits to select features from and downscale others via a softmax mask, enabling end-to-end optimization. They establish that a regularized linear variant of Sequential Attention is equivalent to Sequential LASSO, which in turn is equivalent to Orthogonal Matching Pursuit (OMP) for least-squares regression, transferring OMP's guarantees to the attention framework. Empirically, Sequential Attention achieves state-of-the-art results on standard neural-network benchmarks and scales to large datasets like Criteo, while maintaining efficiency and revealing the role of overparameterization in attention.

Abstract

Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
Paper Structure (35 sections, 5 theorems, 26 equations, 11 figures, 10 tables, 3 algorithms)

This paper contains 35 sections, 5 theorems, 26 equations, 11 figures, 10 tables, 3 algorithms.

Key Result

Theorem 1.1

For linear regression, regularized linear Sequential Attention is equivalent to OMP.

Figures (11)

  • Figure 1: Sequential attention applied to model $f(\cdot;\boldsymbol{\theta})$. At each step, the selected features $i \in S$ are used as direct inputs to the model and the unselected features $i \not\in S$ are downscaled by the scalar value $\mathrm{softmax}_i(\mathbf{w},\overline S)$, where $\mathbf{w}\in\mathbb R^d$ is the vector of learned attention weights and $\overline{S} = [d]\setminus S$.
  • Figure 2: Contour plot of $Q^*(\boldsymbol{\beta} \circ \boldsymbol{\beta})$ for $\boldsymbol{\beta} \in \mathbb{R}^2$ at different zoom-levels of $|\boldsymbol{\beta}_{i}|$.
  • Figure 3: Feature selection results for small-scale neural network experiments. Here, SA = Sequential Attention, LLY = LLY2021, GL = Group LASSO, SL = Sequential LASSO, OMP = OMP, and CAE = Concrete Autoencoder ABZ2019.
  • Figure 4: AUC and log loss when selecting $k \in \{10, 15, 20, 25, 30, 35\}$ features for Criteo dataset.
  • Figure 5: Visualizations of the $k=50$ pixels selected by the feature selection algorithms on MNIST.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Theorem 1.1
  • Theorem 1.2
  • Definition 3.1: Regularized linear Sequential Attention
  • Lemma 3.2
  • Theorem 3.3
  • Remark 1
  • Lemma 3.4: Projection residuals of the Sequential LASSO
  • proof : Proof of Lemma \ref{['lem:proj-res']}
  • proof : Proof of Lemma \ref{['lem:Q-star']}
  • Definition B.1: Marginal gains