AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Yu Li; Lehui Li; Qingmin Liao; Fengli Xu; Yong Li

AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Yu Li, Lehui Li, Qingmin Liao, Fengli Xu, Yong Li

TL;DR

AgentExpt addresses the challenge of recommending baselines and datasets for a research idea by formulating the task as retrieving $R_B(q) \subseteq \mathcal{B}$ and $R_D(q) \subseteq \mathcal{D}$ given a query $q$. It introduces a two-stage framework: a collective perception–augmented retriever that leverages both self-descriptions and citation contexts, and a reasoning–augmented reranker that uses interaction chains to produce interpretable justifications and refined rankings. The authors build AgentExpt from $108{,}825$ papers, $116{,}970$ baselines, and $68{,}316$ datasets across ten AI venues, achieving Recall@20 and HitRate@5 improvements over baselines. This work advances reproducible and interpretable automation of experimental design in AI research.

Abstract

Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85\% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85\% in Recall@20, +8.30\% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.

AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

TL;DR

Abstract

AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)