AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

Yunxiao Shi; Wujiang Xu; Tingwei Chen; Haoning Shang; Ling Yang; Yunfeng Wan; Zhuo Cao; Xing Zi; Dimitris N. Metaxas; Min Xu

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

Yunxiao Shi, Wujiang Xu, Tingwei Chen, Haoning Shang, Ling Yang, Yunfeng Wan, Zhuo Cao, Xing Zi, Dimitris N. Metaxas, Min Xu

TL;DR

AgentSelect is a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data, providing the first unified data and evaluation infrastructure for agent recommendation.

Abstract

LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query-conditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

TL;DR

Abstract

Paper Structure (26 sections, 1 equation, 6 figures, 7 tables)

This paper contains 26 sections, 1 equation, 6 figures, 7 tables.

Introduction
Related Work
LLM Evaluation
Information Retrieval in Agent Ecosystem
Task Definition: Narrative Query-to-Agent Recommendation
AgentSelect Benchmark
Capability Profile Design
Dataset Design
Benchmark Characteristics
Experimental Setup
Results and Analysis
Modality Attribution: IDs vs. Text
Effectiveness Validation for Pseudo-Positive Interactions
Practical Real-World Validation
Validation on MuleRun Agent MarketPlace
...and 11 more sections

Figures (6)

Figure 1: Overview of AgentSelect. We construct three benchmark parts—LLM-only (Part I), toolkit-only (Part II), and compositional agents (Part III)—and use the resulting interactions to train an agent recommender for natural-language queries. Arrows show the flow; icons indicate backbone LLMs, tools, and composed agents.
Figure 2: Compositional Agent Construction and Query-Agent Interaction Simulation Pipeline for Part III.
Figure 3: Overview of Benchmark Characteristics and Distribution.
Figure 4: Long-tailed agent popularity in our benchmark. Left: Pareto curve over all parts. Right: popularity-by-rank curves, with Part I/II/III overlaid. We report two definitions: Unnormalized popularity counts reuse by agent IDs, while Normalized popularity is derived from content components (LLM and tools) and scaled to $[0,1]$, mitigating near-unique agent IDs under one-off supervision.
Figure 5: Online deployment of the recommended agent configuration, including the final runtime policy $C$ finalize and glue code for Agno integration.
...and 1 more figures

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

TL;DR

Abstract

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)