Table of Contents
Fetching ...

AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Yu Li, Lehui Li, Qingmin Liao, Fengli Xu, Yong Li

TL;DR

AgentExpt addresses the challenge of recommending baselines and datasets for a research idea by formulating the task as retrieving $R_B(q) \subseteq \mathcal{B}$ and $R_D(q) \subseteq \mathcal{D}$ given a query $q$. It introduces a two-stage framework: a collective perception–augmented retriever that leverages both self-descriptions and citation contexts, and a reasoning–augmented reranker that uses interaction chains to produce interpretable justifications and refined rankings. The authors build AgentExpt from $108{,}825$ papers, $116{,}970$ baselines, and $68{,}316$ datasets across ten AI venues, achieving Recall@20 and HitRate@5 improvements over baselines. This work advances reproducible and interpretable automation of experimental design in AI research.

Abstract

Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85\% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85\% in Recall@20, +8.30\% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.

AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

TL;DR

AgentExpt addresses the challenge of recommending baselines and datasets for a research idea by formulating the task as retrieving and given a query . It introduces a two-stage framework: a collective perception–augmented retriever that leverages both self-descriptions and citation contexts, and a reasoning–augmented reranker that uses interaction chains to produce interpretable justifications and refined rankings. The authors build AgentExpt from papers, baselines, and datasets across ten AI venues, achieving Recall@20 and HitRate@5 improvements over baselines. This work advances reproducible and interpretable automation of experimental design in AI research.

Abstract

Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85\% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85\% in Recall@20, +8.30\% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.

Paper Structure

This paper contains 37 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the Research Problem
  • Figure 2: Constructing chain-derived candidates and analysis.Left: From the interaction graph of papers, baselines and datasets, we extract interaction chains and aggregate the terminal items to form a chain-derived dataset/baseline pool. Right: (i) Recall (%) between a target paper's actual baselines/datasets and candidates from each setting, and (ii) Precision (%) of overlapped items within the corresponding candidate pool. We evaluate three settings on both the baseline side and the dataset side: chain-derived top--100, same conference-derived top--100, embedding top- 100. Chain-derived candidates recover on average 60.14% of baselines and 78.61% of datasets while occupying 2.52% and 5.63% of the respective chain-derived pools, indicating that interaction chains provide a compact yet highly informative prior for selecting baselines and datasets.
  • Figure 3: Pipeline for constructing the AgentExpt knowledge base. We (1) download and parse papers (from flagship AI conferences), (2) identify baselines and datasets by locating experiment sections, (3) apply rule-based and LLM-based filtering using citation frequency, naming consistency, and contextual positioning to prune false positives. The final dataset contains 108,825 papers, 116,970 baseline entities, 68,316 dataset entities, and their respective cross‑entity connections.
  • Figure 4: The coverage of experiment baselines and datasets. The vertical axis indicates the coverage, calculated as the fraction of resources employed in year N that were introduced in preceding years (1 to N-1), reflecting the dependency on established experimental components over time.
  • Figure 5: Illustration of Collective Perception Augmented Retrieval