Table of Contents
Fetching ...

FeRG-LLM : Feature Engineering by Reason Generation Large Language Models

Jeonghyun Ko, Gyeongyun Park, Donghoon Lee, Kyunam Lee

TL;DR

FeRG-LLM tackles the labor-intensive problem of feature engineering for tabular data by training an 8B-scale LLM (Llama 3.1) with two-stage dialogue and Chain-of-Thought reasoning, then aligning it with Direct Preference Optimization to refine feature-generation rationales. The framework supports local deployment (no cloud API dependence) and, despite its smaller size, matches or surpasses a 70B baseline on most classification tasks and extends effectively to regression, with faster inference. Key innovations include CoT-enabled two-stage dialogue for autonomous feature discovery, LoRA-based SFT, and DPO-driven alignment, all evaluated across 14 binary classification datasets and several regression tasks. The results indicate strong practical value for enterprises with limited resources, offering automated code generation for feature creation and improved data security, while ablations confirm the benefits of rationale generation and DPO alignment.

Abstract

One of the key tasks in machine learning for tabular data is feature engineering. Although it is vital for improving the performance of models, it demands considerable human expertise and deep domain knowledge, making it labor-intensive endeavor. To address this issue, we propose a novel framework, \textbf{FeRG-LLM} (\textbf{Fe}ature engineering by \textbf{R}eason \textbf{G}eneration \textbf{L}arge \textbf{L}anguage \textbf{M}odels), a large language model designed to automatically perform feature engineering at an 8-billion-parameter scale. We have constructed two-stage conversational dialogues that enable language models to analyze machine learning tasks and discovering new features, exhibiting their Chain-of-Thought (CoT) capabilities. We use these dialogues to fine-tune Llama 3.1 8B model and integrate Direct Preference Optimization (DPO) to receive feedback improving quality of new features and the model's performance. Our experiments show that FeRG-LLM performs comparably to or better than Llama 3.1 70B on most datasets, while using fewer resources and achieving reduced inference time. It outperforms other studies in classification tasks and performs well in regression tasks. Moreover, since it does not rely on cloud-hosted LLMs like GPT-4 with extra API costs when generating features, it can be deployed locally, addressing security concerns.

FeRG-LLM : Feature Engineering by Reason Generation Large Language Models

TL;DR

FeRG-LLM tackles the labor-intensive problem of feature engineering for tabular data by training an 8B-scale LLM (Llama 3.1) with two-stage dialogue and Chain-of-Thought reasoning, then aligning it with Direct Preference Optimization to refine feature-generation rationales. The framework supports local deployment (no cloud API dependence) and, despite its smaller size, matches or surpasses a 70B baseline on most classification tasks and extends effectively to regression, with faster inference. Key innovations include CoT-enabled two-stage dialogue for autonomous feature discovery, LoRA-based SFT, and DPO-driven alignment, all evaluated across 14 binary classification datasets and several regression tasks. The results indicate strong practical value for enterprises with limited resources, offering automated code generation for feature creation and improved data security, while ablations confirm the benefits of rationale generation and DPO alignment.

Abstract

One of the key tasks in machine learning for tabular data is feature engineering. Although it is vital for improving the performance of models, it demands considerable human expertise and deep domain knowledge, making it labor-intensive endeavor. To address this issue, we propose a novel framework, \textbf{FeRG-LLM} (\textbf{Fe}ature engineering by \textbf{R}eason \textbf{G}eneration \textbf{L}arge \textbf{L}anguage \textbf{M}odels), a large language model designed to automatically perform feature engineering at an 8-billion-parameter scale. We have constructed two-stage conversational dialogues that enable language models to analyze machine learning tasks and discovering new features, exhibiting their Chain-of-Thought (CoT) capabilities. We use these dialogues to fine-tune Llama 3.1 8B model and integrate Direct Preference Optimization (DPO) to receive feedback improving quality of new features and the model's performance. Our experiments show that FeRG-LLM performs comparably to or better than Llama 3.1 70B on most datasets, while using fewer resources and achieving reduced inference time. It outperforms other studies in classification tasks and performs well in regression tasks. Moreover, since it does not rely on cloud-hosted LLMs like GPT-4 with extra API costs when generating features, it can be deployed locally, addressing security concerns.

Paper Structure

This paper contains 36 sections, 3 equations, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overall framework of FeRG-LLM. The method first generates a two-stage dialogue related to feature discovery and then performs SFT using LoRA. It then optimizes the language model to provide reasoning feedback.
  • Figure 2: Example of two-stage dialogue that facilitates the LLM's reasoning process. The first step involves conceptualizing core ideas, followed by the second step, where these ideas are actualized through the creation of Python code.
  • Figure 3: t-SNE Visualization of Reasoning Embeddings. Yellow points represent embeddings of reasoning generated by FeRG-LLM, while blue points correspond to those generated without DPO.
  • Figure 4: Inference time comparison of the 70B model and FeRG-LLM using different configurations. The 70B model takes 83.71 seconds on four RTX A6000 and 34.10 seconds on two A100, while FeRG-LLM completes inference in 16.4 seconds on a single RTX A6000 and local feature generation in about 5 seconds on a single RTX A6000.
  • Figure 5: Data Generation Prompt Using the GPT API.