Learned Feature Importance Scores for Automated Feature Engineering

Yihe Dong; Sercan Arik; Nathanael Yoder; Tomas Pfister

Learned Feature Importance Scores for Automated Feature Engineering

Yihe Dong, Sercan Arik, Nathanael Yoder, Tomas Pfister

TL;DR

AutoMAN addresses automated feature engineering by learning input-feature masks over a small, expert-curated transform space to optimize downstream tasks end-to-end. It avoids explicit enumeration of all transform-feature combinations by using local and global importance masks to implicitly discover useful engineered features, and it extends to time-series with learnable temporal masks. The method achieves state-of-the-art downstream performance with significantly lower latency compared with baselines, demonstrating scalability to large feature spaces and applicability across tabular and time-series data. The approach is grounded in theoretical motivation that a small basis of transforms can approximate broader function spaces, and it offers practical benefits for model deployment and maintenance.

Abstract

Feature engineering has demonstrated substantial utility for many machine learning workflows, such as in the small data regime or when distribution shifts are severe. Thus automating this capability can relieve much manual effort and improve model performance. Towards this, we propose AutoMAN, or Automated Mask-based Feature Engineering, an automated feature engineering framework that achieves high accuracy, low latency, and can be extended to heterogeneous and time-varying data. AutoMAN is based on effectively exploring the candidate transforms space, without explicitly manifesting transformed features. This is achieved by learning feature importance masks, which can be extended to support other modalities such as time series. AutoMAN learns feature transform importance end-to-end, incorporating a dataset's task target directly into feature engineering, resulting in state-of-the-art performance with significantly lower latency compared to alternatives.

Learned Feature Importance Scores for Automated Feature Engineering

TL;DR

Abstract

Paper Structure (22 sections, 4 theorems, 1 equation, 4 figures, 5 tables, 2 algorithms)

This paper contains 22 sections, 4 theorems, 1 equation, 4 figures, 5 tables, 2 algorithms.

Introduction
Methods
Efficient search in feature transform space
Learning feature importance masks
Extending feature discovery to time series
Transform functions explored
Complexity analysis
Motivation for limiting transform functions
Experiments and Results
Experiments
Datasets
Results
Performance improvement with engineered features.
Qualitative analysis on engineered features.
Related work
...and 7 more sections

Key Result

Lemma 3.1

Given a continuous function $g: [-N, N]^n\to \mathbb{R}$, for any $N > 0$, and any $\epsilon > 0$, there exists a finite sequence of reals $c_i$, and a finite sequence of Gaussian functions $f_i$, such that $|g(x) - \sum_i c_i f_i (x)| < \epsilon$ for all $x\in [-N, N]^n$.

Figures (4)

Figure 1: Overview of the proposed approach, AutoMAN. Local importance masks are learned to select the most relevant features for a given transform, and global importance masks are learned to select the most relevant transformed features across all transforms, all end-to-end using the downstream task objective. The input features are fed into all pertinent transform candidates, depending on whether a feature is categorical, numerical, or temporal. The model prediction head consists of an MLP network. The transform functions $\{f_i\}_{i=1}^k$ are applied to their respective selected features.
Figure 2: Learnable selection of features for a given transform $f$ via local feature importance masking, where $\odot$ denotes element-wise multiplication. Here, three features are selected and weighted by the mask and input into $f$.
Figure 3: Using a global mask to select the global features from the set of transformed feature candidates, optimized for the downstream task performance. Note that since the number of top selected elements for each local and global mask is fixed a priori, the input to the predictor does not vary in size during training.
Figure 4: Performance comparisons between raw and AutoMAN engineered features. Higher is better. MLP (indicated by "M" postfix) and XGBoost (indicated by "X" postfix) predictors are trained on either the raw or engineered features. Interestingly, the performance gain is larger for smaller datasets with fewer samples and features (Mice and Isolet) than for larger datasets (MNIST and Fraud).

Theorems & Definitions (6)

Lemma 3.1
proof
Lemma 3.2
Theorem A.1: Generalized Stone–Weierstrass theorem Stone_Ross_2007
Lemma A.2
proof

Learned Feature Importance Scores for Automated Feature Engineering

TL;DR

Abstract

Learned Feature Importance Scores for Automated Feature Engineering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)