Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents

Henrik Bradland; Morten Goodwin; Vladimir I. Zadorozhny; Per-Arne Andersen

Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents

Henrik Bradland, Morten Goodwin, Vladimir I. Zadorozhny, Per-Arne Andersen

TL;DR

Rogue One introduces a knowledge-informed AutoFE framework built on three collaborative LLM agents (Scientist, Extractor, Tester) that iteratively generate and validate features. It combines a flooding-pruning strategy with a retrieval-augmented generation (RAG) knowledge base to integrate external domain knowledge, yielding semantically meaningful and interpretable features. Empirically, Rogue One outperforms state-of-the-art AutoFE methods on 19 classification and 9 regression datasets, achieving high accuracy and robust performance, while also surfacing novel hypotheses such as a potential biomarker in a myocardial dataset. This work demonstrates the value of multi-agent collaboration and rich qualitative feedback for feature discovery, offering a scalable and interpretable approach to knowledge-informed AutoFE with potential implications for scientific discovery in medicine, finance, and engineering.

Abstract

The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (LLMs) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic LLM architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel, LLM-based multi-agent framework for knowledge-informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents-Scientist, Extractor, and Tester-that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a "flooding-pruning" strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval-augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state-of-the-art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.

Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents

TL;DR

Abstract

Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)