Table of Contents
Fetching ...

FeatNavigator: Automatic Feature Augmentation on Tabular Data

Jiaming Liang, Chuan Lei, Xiao Qin, Jiani Zhang, Asterios Katsifodimos, Christos Faloutsos, Huzefa Rangwala

TL;DR

FeatNavigator addresses automatic feature augmentation on relational tabular data by decomposing the utility gain into feature importance and integration quality. It learns a lightweight FI estimator via clustering and an LSTM-based IQ model to predict the benefit of join paths, then uses a BFS-style search with pruning to efficiently identify high-value feature-path augmentations under a budget. Empirical results on five public datasets show up to 40.1% improvements in ML performance over state-of-the-art baselines, along with favorable end-to-end latency. The framework enables scalable, model-aware augmentation in open-data repositories and adapts to new tables with minimal retraining, making it practical for data-centric ML pipelines.

Abstract

Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator's search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.

FeatNavigator: Automatic Feature Augmentation on Tabular Data

TL;DR

FeatNavigator addresses automatic feature augmentation on relational tabular data by decomposing the utility gain into feature importance and integration quality. It learns a lightweight FI estimator via clustering and an LSTM-based IQ model to predict the benefit of join paths, then uses a BFS-style search with pruning to efficiently identify high-value feature-path augmentations under a budget. Empirical results on five public datasets show up to 40.1% improvements in ML performance over state-of-the-art baselines, along with favorable end-to-end latency. The framework enables scalable, model-aware augmentation in open-data repositories and adapts to new tables with minimal retraining, making it practical for data-centric ML pipelines.

Abstract

Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator's search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.
Paper Structure (22 sections, 5 theorems, 17 equations, 10 figures, 4 tables, 3 algorithms)

This paper contains 22 sections, 5 theorems, 17 equations, 10 figures, 4 tables, 3 algorithms.

Key Result

lemma 1

$US(T_{aug})$ is the weighted average of $\ US(\textit{non-NA})$ and $US(\textit{NA}),$ namely, $US(T_{aug}) = p \times US(\textit{non-NA}) + (1-p) \times US(\textit{NA})$, where $p$ denotes the integration quality measurement for simplicity purposes.

Figures (10)

  • Figure 1: Example of feature augmentation. The schema of the base table, product, is in blue. The task is to predict the values in the boolean type column 'recommend'. The schemata in green and yellow describe the tables that are 1-hop and 2-hop joinable with the base table.
  • Figure 2: FeatNavigator overview.
  • Figure 3: A join graph with examples of augmented tables.
  • Figure 4: Utility gain, feature importance, and integration quality.
  • Figure 5: LSTM network for $IQ$ estimation.
  • ...and 5 more figures

Theorems & Definitions (14)

  • definition 1: Base Table
  • definition 2: Candidate Tables
  • definition 3: Join Path
  • definition 4: Augmented Table
  • definition 5: Machine Learning Task
  • definition 6: Utility Gain
  • definition 7: Feature Importance
  • definition 8: Integration Quality
  • lemma 1
  • lemma 2
  • ...and 4 more