Table of Contents
Fetching ...

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

Yunxiao Shi, Wujiang Xu, Tingwei Chen, Haoning Shang, Ling Yang, Yunfeng Wan, Zhuo Cao, Xing Zi, Dimitris N. Metaxas, Min Xu

TL;DR

AgentSelect is a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data, providing the first unified data and evaluation infrastructure for agent recommendation.

Abstract

LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query-conditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

TL;DR

AgentSelect is a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data, providing the first unified data and evaluation infrastructure for agent recommendation.

Abstract

LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query-conditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.
Paper Structure (26 sections, 1 equation, 6 figures, 7 tables)

This paper contains 26 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of AgentSelect. We construct three benchmark parts—LLM-only (Part I), toolkit-only (Part II), and compositional agents (Part III)—and use the resulting interactions to train an agent recommender for natural-language queries. Arrows show the flow; icons indicate backbone LLMs, tools, and composed agents.
  • Figure 2: Compositional Agent Construction and Query-Agent Interaction Simulation Pipeline for Part III.
  • Figure 3: Overview of Benchmark Characteristics and Distribution.
  • Figure 4: Long-tailed agent popularity in our benchmark. Left: Pareto curve over all parts. Right: popularity-by-rank curves, with Part I/II/III overlaid. We report two definitions: Unnormalized popularity counts reuse by agent IDs, while Normalized popularity is derived from content components (LLM and tools) and scaled to $[0,1]$, mitigating near-unique agent IDs under one-off supervision.
  • Figure 5: Online deployment of the recommended agent configuration, including the final runtime policy $C$ finalize and glue code for Agno integration.
  • ...and 1 more figures