Table of Contents
Fetching ...

LLM-Select: Feature Selection with Large Language Models

Daniel P. Jeong, Zachary C. Lipton, Pradeep Ravikumar

TL;DR

LLM-Select shows that prompting large language models to assess feature relevance from only feature names and a target description can yield effective feature selection without accessing downstream training data. The authors introduce three prompting strategies—LLM-Score, LLM-Rank, and LLM-Seq—and systematically evaluate them across zero-shot and context-rich prompts using models from GPT-4 to Llama-2. Across small- and large-scale real-world datasets, GPT-4-based LLM-Score often matches or surpasses traditional baselines such as LASSO, highlighting the potential to inform both feature collection and data acquisition decisions in high-stakes domains like healthcare. The work also analyzes prompt design, decoding strategies, and the relationship between LLM-derived scores and conventional feature importance metrics, revealing that model scale improves alignment with standard notions of importance. Limitations include dependence on text semantics and potential biases, suggesting a hybrid or human-in-the-loop approach for practical deployment.

Abstract

In this paper, we demonstrate a surprising capability of large language models (LLMs): given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g., "blood pressure") in predicting an outcome of interest (e.g., "heart failure"), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could benefit practitioners in domains like healthcare and the social sciences, where collecting high-quality data comes at a high cost.

LLM-Select: Feature Selection with Large Language Models

TL;DR

LLM-Select shows that prompting large language models to assess feature relevance from only feature names and a target description can yield effective feature selection without accessing downstream training data. The authors introduce three prompting strategies—LLM-Score, LLM-Rank, and LLM-Seq—and systematically evaluate them across zero-shot and context-rich prompts using models from GPT-4 to Llama-2. Across small- and large-scale real-world datasets, GPT-4-based LLM-Score often matches or surpasses traditional baselines such as LASSO, highlighting the potential to inform both feature collection and data acquisition decisions in high-stakes domains like healthcare. The work also analyzes prompt design, decoding strategies, and the relationship between LLM-derived scores and conventional feature importance metrics, revealing that model scale improves alignment with standard notions of importance. Limitations include dependence on text semantics and potential biases, suggesting a hybrid or human-in-the-loop approach for practical deployment.

Abstract

In this paper, we demonstrate a surprising capability of large language models (LLMs): given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g., "blood pressure") in predicting an outcome of interest (e.g., "heart failure"), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could benefit practitioners in domains like healthcare and the social sciences, where collecting high-quality data comes at a high cost.
Paper Structure (114 sections, 3 equations, 20 figures, 5 tables)

This paper contains 114 sections, 3 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Selecting features by zero-shot prompting an LLM leads to strong downstream predictive performance, competitive with data-driven feature selection methods. (a) Overview of our proposed LLM-Score, LLM-Rank, and LLM-Seq methods (Section \ref{['sec:llm-fs']}). (b) Average test AUROC (higher is better) on classification datasets when selecting the top 30% of features according to the best-performing data-driven baseline on each dataset (in red), LLM-Score based on GPT-4 (in blue), and a random feature selection baseline (in black). Error bars indicate standard error across datasets in each group.
  • Figure 2: LLM-Score shows competitive feature selection performance against data-driven baselines, given an LLM of sufficient scale. (a) Average AUROC (left; higher is better) and ranking by MAE (right; lower is better) across all datasets when selecting the top 30% of features. (b) Feature selection paths for LLM-Score (GPT-4), the best-performing baseline, and random selection on datasets published after the LLM cutoff dates. (c) Feature selection paths for LLM-Score on the same datasets, with varying LLM scale.
  • Figure 3: Feature selection paths for LLM-Score, LLM-Rank, and LLM-Seq based on (a) GPT-4 and (b) GPT-3.5 on all classification and regression datasets. Within each panel, the top row shows the results on the classification datasets, and the bottom row shows the results on the regression datasets. GPT-4-based methods all show consistently strong performance across datasets, showing substantial overlap in their corresponding feature selection paths. GPT-3.5-based methods also show similar trends, which are albeit less pronounced. Datasets marked with an asterisk (*) were published after the LLM cutoff dates.
  • Figure 4: Changes in average improvement (%) in LLM-Score feature selection performance as we vary the decoding strategy ($T=0$: greedy, $T=0.5$: self-consistency) and prompt design (in parentheses), compared to the performance achieved under the default prompting setup (in bold; see Section \ref{['sec:exp']}). Error bars indicate standard error across datasets. On average, no approach substantially improves over the default setting.
  • Figure 5: Average rank correlation (Kendall's $\tau$) between each feature importance metric and LLM-Score based on GPT-4, GPT-3.5, and Llama-2. Error bars indicate standard error across datasets. LLM-Score generally exhibits higher rank correlation with standard feature importance metrics as model scale increases, but does not uniquely align to a specific notion of feature importance.
  • ...and 15 more figures