Table of Contents
Fetching ...

FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables

Danrui Qi, Weiling Zheng, Jiannan Wang

TL;DR

FEATAuG, a new feature augmentation framework that automatically extracts predicate-aware SQL queries from one-to-many relationship tables, is proposed and it is shown that how the beam search idea can partially solve the problem and several techniques to further optimize it are proposed.

Abstract

Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development. To augment good features, data scientists need to come up with SQL queries manually, which is time-consuming. Featuretools [1] is a widely used tool by the data science community to automatically augment the training data by extracting new features from relevant tables. It represents each feature as a group-by aggregation SQL query on relevant tables and can automatically generate these SQL queries. However, it does not include predicates in these queries, which significantly limits its application in many real-world scenarios. To overcome this limitation, we propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware SQL queries from one-to-many relationship tables. This extension is not trivial because considering predicates will exponentially increase the number of candidate queries. As a result, the original Featuretools framework, which materializes all candidate queries, will not work and needs to be redesigned. We formally define the problem and model it as a hyperparameter optimization problem. We discuss how the Bayesian Optimization can be applied here and propose a novel warm-up strategy to optimize it. To make our algorithm more practical, we also study how to identify promising attribute combinations for predicates. We show that how the beam search idea can partially solve the problem and propose several techniques to further optimize it. Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools and other baselines. The code is open-sourced at https://github.com/sfu-db/FeatAug

FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables

TL;DR

FEATAuG, a new feature augmentation framework that automatically extracts predicate-aware SQL queries from one-to-many relationship tables, is proposed and it is shown that how the beam search idea can partially solve the problem and several techniques to further optimize it are proposed.

Abstract

Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development. To augment good features, data scientists need to come up with SQL queries manually, which is time-consuming. Featuretools [1] is a widely used tool by the data science community to automatically augment the training data by extracting new features from relevant tables. It represents each feature as a group-by aggregation SQL query on relevant tables and can automatically generate these SQL queries. However, it does not include predicates in these queries, which significantly limits its application in many real-world scenarios. To overcome this limitation, we propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware SQL queries from one-to-many relationship tables. This extension is not trivial because considering predicates will exponentially increase the number of candidate queries. As a result, the original Featuretools framework, which materializes all candidate queries, will not work and needs to be redesigned. We formally define the problem and model it as a hyperparameter optimization problem. We discuss how the Bayesian Optimization can be applied here and propose a novel warm-up strategy to optimize it. To make our algorithm more practical, we also study how to identify promising attribute combinations for predicates. We show that how the beam search idea can partially solve the problem and propose several techniques to further optimize it. Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools and other baselines. The code is open-sourced at https://github.com/sfu-db/FeatAug
Paper Structure (52 sections, 5 equations, 9 figures, 8 tables)

This paper contains 52 sections, 5 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Feature augmentation with predicate-aware SQL queries.
  • Figure 2: Workflow of FeatAug.
  • Figure 3: Workflow of SQL Query Generation Component. Mutual Information (MI) is taken as the low-cost proxy.
  • Figure 4: The illustration for the search space and the process of the Query Template Identification component. ($\beta = 1$)
  • Figure 5: The ablation study of two optimization in the Query Template Identification component. (a): the running time of the Query Template Identification component w/o the two optimizations. "X" means that the program cannot complete in 6 hours. (b) - (e): the performance comparison among FeatAug with different Query Template Identification components.
  • ...and 4 more figures

Theorems & Definitions (15)

  • Example 1
  • Example 2
  • Example 3
  • Example 4
  • Definition 1: Query Template
  • Example 5
  • Definition 2: Query Pool
  • Example 6
  • Definition 3: Augmented Training Table
  • Example 7
  • ...and 5 more