IIFE: Interaction Information Based Automated Feature Engineering

Tom Overman; Diego Klabjan; Jean Utke

IIFE: Interaction Information Based Automated Feature Engineering

Tom Overman, Diego Klabjan, Jean Utke

TL;DR

The paper addresses automated feature engineering (AutoFE) by introducing IIFE, an iterative method that uses interaction information to identify synergistic feature pairs and construct high-quality features. It formalizes the interaction information as $\tau_{ij}=I(F_i,F_j,Y)=I(F_i,F_j|Y)-I(F_i,F_j)$ and uses this score to drive successive feature pairings, uni-, and bi-variate transformations, validated through cross-validated model evaluation. Empirical results show IIFE outperforms recent AutoFE baselines on multiple public datasets and on a large-scale private dataset, while also enabling acceleration of other expand-reduce AutoFE approaches and enabling combination with other methods for further gains. The work also critiques prevalent AutoFE evaluation practices, demonstrating inflated cross-validation gains and proposing inductive hold-out evaluation as a more realistic metric, with practical implications for reproducibility and fair comparisons.

Abstract

Automated feature engineering (AutoFE) is the process of automatically building and selecting new features that help improve downstream predictive performance. While traditional feature engineering requires significant domain expertise and time-consuming iterative testing, AutoFE strives to make feature engineering easy and accessible to all data science practitioners. We introduce a new AutoFE algorithm, IIFE, based on determining which feature pairs synergize well through an information-theoretic perspective called interaction information. We demonstrate the superior performance of IIFE over existing algorithms. We also show how interaction information can be used to improve existing AutoFE algorithms. Finally, we highlight several critical experimental setup issues in the existing AutoFE literature and their effects on performance.

IIFE: Interaction Information Based Automated Feature Engineering

TL;DR

and uses this score to drive successive feature pairings, uni-, and bi-variate transformations, validated through cross-validated model evaluation. Empirical results show IIFE outperforms recent AutoFE baselines on multiple public datasets and on a large-scale private dataset, while also enabling acceleration of other expand-reduce AutoFE approaches and enabling combination with other methods for further gains. The work also critiques prevalent AutoFE evaluation practices, demonstrating inflated cross-validation gains and proposing inductive hold-out evaluation as a more realistic metric, with practical implications for reproducibility and fair comparisons.

Abstract

Paper Structure (19 sections, 6 figures, 5 tables, 3 algorithms)

This paper contains 19 sections, 6 figures, 5 tables, 3 algorithms.

Introduction
Related Work
Algorithm Description
Experimental Results
Algorithm Comparisons
Experimental Setup
Public Data Results
Results on a large-scale proprietary data set
Experimental Verification of Interaction Information
Issues in AutoFE Literature
Cross-validation scores as performance metric
OpenFE transductive setting
Improving other algorithms with interaction information
Combining AutoFE algorithms
Conclusion
...and 4 more sections

Figures (6)

Figure 1: Flowchart of IIFE using a toy example of three starting features and few uni- and bivariate functions. In realistic settings there would be larger sets $\cal F,B$ and $\cal U$. The next iteration will include the newly engineered feature $\log(F_1 * F_2)$ in the pool of features, so increasingly complex features are engineered.
Figure 2: Top: Percent improvement of each algorithm over the baseline test score over all public datasets/models. The box represents the interquartile range with the central line being the median. The whiskers extend to the largest value within 1.5 times the interquartile range. The circles represent outliers. Bottom: Average rank for each algorithm over all of the public datasets and runs for all models, linear models, RF models, and LGBM models. Error bars show the 25th and 75th percentiles. The best performing algorithm would be rank 1, so lower average rank number is better.
Figure 3: Plot demonstrating similar improvements between IIFE+Lasso, RF$^*$, and LGBM$^*$ relative to Lasso$^*$/LR$^*$. The $^*$ denotes models trained only on the original features. The errors bars are +/- the standard deviation across all 25 runs. This shows that on many datasets, engineering features with IIFE can bring linear models close to the performance of large, complicated nonlinear models such as random forest and LightGBM with large numbers of estimators and depth of trees.
Figure 4: The blue bars show the feature importance, the red dots show the order of the feature, and the black line shows the cross-validation score after the feature was added. The faded lines in the background are the cross-validation curves for the other 24 runs with different random seeds. Some of the additional faded curves are truncated to focus the plot on the key iterations. Top: The plot shows a case where forming a large number of highly complex features helps performance. It also depicts the validation score growing as the number of features increases (the first several features of order 1 are the original features). The most important engineered feature is order 10 which is out of the practical complexity range for the majority of AutoFE algorithms. Bottom: The plot shows a case where it is more beneficial to create fewer, low-order engineered features to boost validation scores. This is typically the case for more complicated models such as random forest and LightGBM which can already model complex non-linear behavior.
Figure 5: Histograms of rank of the true feature pair when computing the interaction information of the feature pair with the synthetic target built from the feature pair. This is repeated across all possible feature pairs. We expect the rank to be zero or a low number and most of the density of the histogram to be on the left side. The synthetic target is constructed for $F_i +F_j$, $F_i F_j$, $\sin(F_i^2 + F_ix_j + F_j^2)$, and $\exp(|\max(F_i,F_j)|)$ which are increasingly complex, nonlinear, and in the final example, non-smooth.
...and 1 more figures

IIFE: Interaction Information Based Automated Feature Engineering

TL;DR

Abstract

IIFE: Interaction Information Based Automated Feature Engineering

Authors

TL;DR

Abstract

Table of Contents

Figures (6)