Retrieval-Augmented Feature Generation for Domain-Specific Classification
Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dakshak Keerthi Chandra, Yu-Zhong Chen, Fei Xie, Kunpeng Liu
TL;DR
This work addresses the scarcity and interpretability issues in domain-specific classification by introducing RAFG, a training-free framework that uses retrieval-augmented generation and LLM-based reasoning to create meaningful domain-specific features. By retrieving external knowledge and reasoning over it to transform existing features, RAFG expands the feature space with interpretable indicators and validates them iteratively against downstream task performance. Across four real-world datasets spanning medical, economic, and geographic domains, RAFG outperforms baselines, increases information gain, and produces interpretable features with clear domain semantics. The approach demonstrates strong performance gains without domain-specific fine-tuning, highlighting its potential for broad, adaptable deployment in specialized classification tasks.
Abstract
Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is to expand the current feature space using existing features and enriching the informational content. However, generating new, interpretable features usually requires domain-specific knowledge on top of the existing features. In this paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to generate useful and explainable features specific to domain classification tasks. To increase the interpretability of the generated features, we conduct knowledge retrieval among the existing features in the domain to identify potential feature associations. These associations are expected to help generate useful features. Moreover, we develop a framework based on large language models (LLMs) for feature generation with reasoning to verify the quality of the features during their generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method can produce high-quality, meaningful features and significantly improve classification performance compared with baseline methods.
