Evolutionary Large Language Model for Automated Feature Transformation

Nanxu Gong; Chandan K. Reddy; Wangyang Ying; Haifeng Chen; Yanjie Fu

Evolutionary Large Language Model for Automated Feature Transformation

Nanxu Gong, Chandan K. Reddy, Wangyang Ying, Haifeng Chen, Yanjie Fu

TL;DR

The paper tackles Automated Feature Transformation (AFT) under a combinatorially large search space by introducing Evolutionary Large Language Model for Generative Feature Transformation (ELLM-FT). It couples a reinforcement-learning data collector that builds a diverse multi-population database with an LLM that generates postfix feature transformation sequences via few-shot prompts, guided by evolutionary maintenance and verification against downstream performance. The method demonstrates competitive performance across 12 datasets, showing robustness to noise and generalization across dataset sizes, while enabling efficient exploration of the feature space. This approach advances AFT by blending general knowledge from LLMs with task-specific feedback through evolutionary strategies, offering a scalable and adaptable search paradigm for feature engineering.

Abstract

Feature transformation aims to reconstruct the feature space of raw features to enhance the performance of downstream models. However, the exponential growth in the combinations of features and operations poses a challenge, making it difficult for existing methods to efficiently explore a wide space. Additionally, their optimization is solely driven by the accuracy of downstream models in specific domains, neglecting the acquisition of general feature knowledge. To fill this research gap, we propose an evolutionary LLM framework for automated feature transformation. This framework consists of two parts: 1) constructing a multi-population database through an RL data collector while utilizing evolutionary algorithm strategies for database maintenance, and 2) utilizing the ability of Large Language Model (LLM) in sequence understanding, we employ few-shot prompts to guide LLM in generating superior samples based on feature transformation sequence distinction. Leveraging the multi-population database initially provides a wide search scope to discover excellent populations. Through culling and evolution, the high-quality populations are afforded greater opportunities, thereby furthering the pursuit of optimal individuals. Through the integration of LLMs with evolutionary algorithms, we achieve efficient exploration within a vast space, while harnessing feature knowledge to propel optimization, thus realizing a more adaptable search paradigm. Finally, we empirically demonstrate the effectiveness and generality of our proposed method.

Evolutionary Large Language Model for Automated Feature Transformation

TL;DR

Abstract

Paper Structure (17 sections, 6 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 6 figures, 2 tables, 1 algorithm.

Introduction
Problem Statement
Evolutionary LLM For Generative Feature Transformation
Overview of the ELLM-FT Framework
Reinforcement Multi-Population Training Database Construction
LLM-based Feature Transformation Operation Sequence Generation
Experimental Results
Experimental Setup
Overall Comparisons
Examining The Performance Trajectory of Iterative Explorations
A Study of Prompt Design
Examining the Impact of RL Data Collector
Robustness Check
Related work
Automated Feature Transformation
...and 2 more sections

Figures (6)

Figure 1: An example of feature transformation sequence. $s_i$ denotes the feature sequence.
Figure 2: Framework overview. Firstly, we utilize the RL data collector to construct the database, Then, we leverage pre-trained LLM to iteratively generate new feature transformation sequences while simultaneously updating the database.
Figure 3: An example of our prompt consisting of the instruction and few-shot feature transformation operation sequence samples.
Figure 4: The performance trajectory of the proposed method compared to GRFG on two datasets.
Figure 5: Results of the proposed method using three different prompts. We compared (a) downstream task accuracy across two datasets, and (b) valid sample numbers.
...and 1 more figures

Evolutionary Large Language Model for Automated Feature Transformation

TL;DR

Abstract

Evolutionary Large Language Model for Automated Feature Transformation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)