Table of Contents
Fetching ...

Retrieval-Augmented Feature Generation for Domain-Specific Classification

Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dakshak Keerthi Chandra, Yu-Zhong Chen, Fei Xie, Kunpeng Liu

TL;DR

This work addresses the scarcity and interpretability issues in domain-specific classification by introducing RAFG, a training-free framework that uses retrieval-augmented generation and LLM-based reasoning to create meaningful domain-specific features. By retrieving external knowledge and reasoning over it to transform existing features, RAFG expands the feature space with interpretable indicators and validates them iteratively against downstream task performance. Across four real-world datasets spanning medical, economic, and geographic domains, RAFG outperforms baselines, increases information gain, and produces interpretable features with clear domain semantics. The approach demonstrates strong performance gains without domain-specific fine-tuning, highlighting its potential for broad, adaptable deployment in specialized classification tasks.

Abstract

Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is to expand the current feature space using existing features and enriching the informational content. However, generating new, interpretable features usually requires domain-specific knowledge on top of the existing features. In this paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to generate useful and explainable features specific to domain classification tasks. To increase the interpretability of the generated features, we conduct knowledge retrieval among the existing features in the domain to identify potential feature associations. These associations are expected to help generate useful features. Moreover, we develop a framework based on large language models (LLMs) for feature generation with reasoning to verify the quality of the features during their generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method can produce high-quality, meaningful features and significantly improve classification performance compared with baseline methods.

Retrieval-Augmented Feature Generation for Domain-Specific Classification

TL;DR

This work addresses the scarcity and interpretability issues in domain-specific classification by introducing RAFG, a training-free framework that uses retrieval-augmented generation and LLM-based reasoning to create meaningful domain-specific features. By retrieving external knowledge and reasoning over it to transform existing features, RAFG expands the feature space with interpretable indicators and validates them iteratively against downstream task performance. Across four real-world datasets spanning medical, economic, and geographic domains, RAFG outperforms baselines, increases information gain, and produces interpretable features with clear domain semantics. The approach demonstrates strong performance gains without domain-specific fine-tuning, highlighting its potential for broad, adaptable deployment in specialized classification tasks.

Abstract

Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is to expand the current feature space using existing features and enriching the informational content. However, generating new, interpretable features usually requires domain-specific knowledge on top of the existing features. In this paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to generate useful and explainable features specific to domain classification tasks. To increase the interpretability of the generated features, we conduct knowledge retrieval among the existing features in the domain to identify potential feature associations. These associations are expected to help generate useful features. Moreover, we develop a framework based on large language models (LLMs) for feature generation with reasoning to verify the quality of the features during their generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method can produce high-quality, meaningful features and significantly improve classification performance compared with baseline methods.
Paper Structure (16 sections, 9 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 9 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Framework of RAFG. We adopt an LLM to generate new features according to the retrieved textual information containing expertise knowledge (e.g., the BMI as shown case).
  • Figure 2: Framework of RAFG. Given an input data table including a description, feature vectors, and a target label vector, the LLM first integrates the text information of description, label information, and data types to embed and form a query. Then, with this query we adopt RAG technology to search through an external library for one of several relevant documents which can guide the LLM in creating a new feature with most potential. After that, we test the template data table with the new feature for metrics improvement, and the LLM decides whether to reserve this new feature. This searching and generation process iterates until reaching the maximum rounds of iteration, or the best feature space is found.
  • Figure 3: Accuracy of using RF with RAFG and various LLMs.
  • Figure 4: Information gain with RAFG across different datasets.
  • Figure 5: Accuracy variation in RAFG feature generation process and the information gain for GCI dataset with DT model.
  • ...and 1 more figures