Table of Contents
Fetching ...

Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models

Yebin Lim, Susik Yoon

TL;DR

This paper tackles the problem of robustness in LLM-driven feature engineering for tabular data by introducing a multi-level diagnosis and evaluation framework. It defines three core elements—Golden Variable, Golden Relation, and Golden Value—and provides reliability scores (RS1–RS3) and feature scores (FS1–FS3) to quantify LLM consistency across variable importance, variable–class relations, and decision boundaries. Through experiments on eight benchmark datasets and multiple LLMs, the study shows that LLM robustness varies significantly by dataset and input conditions, and that high-quality, diagnostically selected features can improve few-shot prediction performance by up to $10.52\%$ AUROC over baselines. The framework thus offers a principled approach to assessing and enhancing the reliability of LLM-driven feature engineering in diverse domains, with practical implications for deploying LLMs in data science workflows.

Abstract

Recent advancements in large language models (LLMs) have shown promise in feature engineering for tabular data, but concerns about their reliability persist, especially due to variability in generated outputs. We introduce a multi-level diagnosis and evaluation framework to assess the robustness of LLMs in feature engineering across diverse domains, focusing on the three main factors: key variables, relationships, and decision boundary values for predicting target classes. We demonstrate that the robustness of LLMs varies significantly over different datasets, and that high-quality LLM-generated features can improve few-shot prediction performance by up to 10.52%. This work opens a new direction for assessing and enhancing the reliability of LLM-driven feature engineering in various domains.

Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models

TL;DR

This paper tackles the problem of robustness in LLM-driven feature engineering for tabular data by introducing a multi-level diagnosis and evaluation framework. It defines three core elements—Golden Variable, Golden Relation, and Golden Value—and provides reliability scores (RS1–RS3) and feature scores (FS1–FS3) to quantify LLM consistency across variable importance, variable–class relations, and decision boundaries. Through experiments on eight benchmark datasets and multiple LLMs, the study shows that LLM robustness varies significantly by dataset and input conditions, and that high-quality, diagnostically selected features can improve few-shot prediction performance by up to AUROC over baselines. The framework thus offers a principled approach to assessing and enhancing the reliability of LLM-driven feature engineering in diverse domains, with practical implications for deploying LLMs in data science workflows.

Abstract

Recent advancements in large language models (LLMs) have shown promise in feature engineering for tabular data, but concerns about their reliability persist, especially due to variability in generated outputs. We introduce a multi-level diagnosis and evaluation framework to assess the robustness of LLMs in feature engineering across diverse domains, focusing on the three main factors: key variables, relationships, and decision boundary values for predicting target classes. We demonstrate that the robustness of LLMs varies significantly over different datasets, and that high-quality LLM-generated features can improve few-shot prediction performance by up to 10.52%. This work opens a new direction for assessing and enhancing the reliability of LLM-driven feature engineering in various domains.

Paper Structure

This paper contains 54 sections, 16 equations, 25 figures, 3 tables.

Figures (25)

  • Figure 1: LLMs vulnerable to generating features of varying quality (left). Measures for high-quality features leading to performance improvement (right).
  • Figure 2: Overall procedure of our framework involves a multi-level scheme of variables, relations, and values to diagnose reliability and evaluate features generated by LLMs in feature engineering on different domains and inputs.
  • Figure 3: Variation of reliability scores (averaged over three levels) for different LLMs and datasets.
  • Figure 4: The change of variance and bias of reliability score with varying input for GPT-3.5-Turbo.
  • Figure 5: Variation of reliability scores of each level for different models and datasets.
  • ...and 20 more figures