Table of Contents
Fetching ...

Dynamic and Adaptive Feature Generation with LLM

Xinhao Zhang, Jinghan Zhang, Banafsheh Rekabdar, Yuanchun Zhou, Pengfei Wang, Kunpeng Liu

TL;DR

This work addresses the challenge of feature generation by introducing LFG, a dynamic, adaptable, and interpretable framework that leverages large language models (LLMs) and Tree of Thoughts (ToT) to generate and refine feature spaces. The method uses multiple expert LLM agents to create new features via a predefined operation set, evaluates their impact on downstream tasks, and iteratively improves strategies through feedback and Monte Carlo Tree Search (MCTS). Key contributions include an end-to-end LLM-driven feature-generation pipeline, explicit interpretability of operations, and demonstrated superiority over baselines across several downstream classification tasks. The approach promises enhanced applicability to varied data types and tasks, offering greater flexibility and potential for broader deployment in automated feature engineering.

Abstract

The representation of feature space is a crucial environment where data points get vectorized and embedded for subsequent modeling. Thus the efficacy of machine learning (ML) algorithms is closely related to the quality of feature engineering. As one of the most important techniques, feature generation transforms raw data into an optimized feature space conducive to model training and further refines the space. Despite the advancements in automated feature engineering and feature generation, current methodologies often suffer from three fundamental issues: lack of explainability, limited applicability, and inflexible strategy. These shortcomings frequently hinder and limit the deployment of ML models across varied scenarios. Our research introduces a novel approach adopting large language models (LLMs) and feature-generating prompts to address these challenges. We propose a dynamic and adaptive feature generation method that enhances the interpretability of the feature generation process. Our approach broadens the applicability across various data types and tasks and offers advantages over strategic flexibility. A broad range of experiments showcases that our approach is significantly superior to existing methods.

Dynamic and Adaptive Feature Generation with LLM

TL;DR

This work addresses the challenge of feature generation by introducing LFG, a dynamic, adaptable, and interpretable framework that leverages large language models (LLMs) and Tree of Thoughts (ToT) to generate and refine feature spaces. The method uses multiple expert LLM agents to create new features via a predefined operation set, evaluates their impact on downstream tasks, and iteratively improves strategies through feedback and Monte Carlo Tree Search (MCTS). Key contributions include an end-to-end LLM-driven feature-generation pipeline, explicit interpretability of operations, and demonstrated superiority over baselines across several downstream classification tasks. The approach promises enhanced applicability to varied data types and tasks, offering greater flexibility and potential for broader deployment in automated feature engineering.

Abstract

The representation of feature space is a crucial environment where data points get vectorized and embedded for subsequent modeling. Thus the efficacy of machine learning (ML) algorithms is closely related to the quality of feature engineering. As one of the most important techniques, feature generation transforms raw data into an optimized feature space conducive to model training and further refines the space. Despite the advancements in automated feature engineering and feature generation, current methodologies often suffer from three fundamental issues: lack of explainability, limited applicability, and inflexible strategy. These shortcomings frequently hinder and limit the deployment of ML models across varied scenarios. Our research introduces a novel approach adopting large language models (LLMs) and feature-generating prompts to address these challenges. We propose a dynamic and adaptive feature generation method that enhances the interpretability of the feature generation process. Our approach broadens the applicability across various data types and tasks and offers advantages over strategic flexibility. A broad range of experiments showcases that our approach is significantly superior to existing methods.
Paper Structure (33 sections, 9 equations, 6 figures, 2 tables)

This paper contains 33 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our goal is to iteratively reconstruct the feature space to create an optimal and explainable feature set that enhances performance on downstream machine learning tasks.
  • Figure 2: The framework of LFG. First, we input the original feature set and operation set into the LLM's context window. Second, we guide the LLM in creating three expert agents. Each agent generates new features with operations from the operation set, and then combines new features with original features to create a new feature subset. Then, each of these subsets is individually evaluated on a downstream task. After that, we provide the performance of each feature subset in downstream tasks as feedback to the respective agents and iterate the process until the best feature subset is found or the maximum number of iterations is reached.
  • Figure 3: The framework of MCTS for feature generation includes five stages: 1) Performance Evaluation, 2) Node Selection, 3) Node Expansion, 4) Node Generation, 5) Optimal Subset Searching.
  • Figure 4: Comparison on KNN (Accuracy and F1).
  • Figure 5: Increase of feature numbers.
  • ...and 1 more figures