Retrieval-Augmented Feature Generation for Domain-Specific Classification

Xinhao Zhang; Jinghan Zhang; Fengran Mo; Dakshak Keerthi Chandra; Yu-Zhong Chen; Fei Xie; Kunpeng Liu

Retrieval-Augmented Feature Generation for Domain-Specific Classification

Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dakshak Keerthi Chandra, Yu-Zhong Chen, Fei Xie, Kunpeng Liu

TL;DR

This work addresses the scarcity and interpretability issues in domain-specific classification by introducing RAFG, a training-free framework that uses retrieval-augmented generation and LLM-based reasoning to create meaningful domain-specific features. By retrieving external knowledge and reasoning over it to transform existing features, RAFG expands the feature space with interpretable indicators and validates them iteratively against downstream task performance. Across four real-world datasets spanning medical, economic, and geographic domains, RAFG outperforms baselines, increases information gain, and produces interpretable features with clear domain semantics. The approach demonstrates strong performance gains without domain-specific fine-tuning, highlighting its potential for broad, adaptable deployment in specialized classification tasks.

Abstract

Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is to expand the current feature space using existing features and enriching the informational content. However, generating new, interpretable features usually requires domain-specific knowledge on top of the existing features. In this paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to generate useful and explainable features specific to domain classification tasks. To increase the interpretability of the generated features, we conduct knowledge retrieval among the existing features in the domain to identify potential feature associations. These associations are expected to help generate useful features. Moreover, we develop a framework based on large language models (LLMs) for feature generation with reasoning to verify the quality of the features during their generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method can produce high-quality, meaningful features and significantly improve classification performance compared with baseline methods.

Retrieval-Augmented Feature Generation for Domain-Specific Classification

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 9 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Feature Generation
Retrieval-Augmented Machine Learning
Problem Statement
Methodology
Query Generation and Domain Knowledge Retrieval
Retrieval-Augmented Feature Generation with Reasoning
Feature Content Validation by LLM
Experiments
Experimental Settings
Experimental Results
Generated Features and Interpretability
Performance and Information Gain Evolution
Correlation
...and 1 more sections

Figures (6)

Figure 1: Framework of RAFG. We adopt an LLM to generate new features according to the retrieved textual information containing expertise knowledge (e.g., the BMI as shown case).
Figure 2: Framework of RAFG. Given an input data table including a description, feature vectors, and a target label vector, the LLM first integrates the text information of description, label information, and data types to embed and form a query. Then, with this query we adopt RAG technology to search through an external library for one of several relevant documents which can guide the LLM in creating a new feature with most potential. After that, we test the template data table with the new feature for metrics improvement, and the LLM decides whether to reserve this new feature. This searching and generation process iterates until reaching the maximum rounds of iteration, or the best feature space is found.
Figure 3: Accuracy of using RF with RAFG and various LLMs.
Figure 4: Information gain with RAFG across different datasets.
Figure 5: Accuracy variation in RAFG feature generation process and the information gain for GCI dataset with DT model.
...and 1 more figures

Retrieval-Augmented Feature Generation for Domain-Specific Classification

TL;DR

Abstract

Retrieval-Augmented Feature Generation for Domain-Specific Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (6)