Table of Contents
Fetching ...

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He

TL;DR

Toolathlon presents a large-scale, real-world benchmark for language agents that spans 32 applications and 604 tools across 108 tasks, each starting from realistic initial states and evaluated through deterministic, execution-based scripts. It combines remote and containerized environments with a robust agent framework to enable safe, parallel evaluation, and reveals that current SOTA models struggle with long-horizon, cross-app coordination, achieving only modest success rates. The work highlights key challenges in long-context handling and tool-calling robustness, while also providing detailed analyses of errors and cost-performance trade-offs. By open-sourcing the benchmark and framework, Toolathlon aims to accelerate progress toward practical, high-reliability language agents capable of real-world workflows.

Abstract

Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

TL;DR

Toolathlon presents a large-scale, real-world benchmark for language agents that spans 32 applications and 604 tools across 108 tasks, each starting from realistic initial states and evaluated through deterministic, execution-based scripts. It combines remote and containerized environments with a robust agent framework to enable safe, parallel evaluation, and reveals that current SOTA models struggle with long-horizon, cross-app coordination, achieving only modest success rates. The work highlights key challenges in long-context handling and tool-calling robustness, while also providing detailed analyses of errors and cost-performance trade-offs. By open-sourcing the benchmark and framework, Toolathlon aims to accelerate progress toward practical, high-reliability language agents capable of real-world workflows.

Abstract

Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.

Paper Structure

This paper contains 36 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Two examples and the initial environment states in Toolathlon. We showcase real-world environment interaction (§\ref{['sec:env']}) and realistc state initialization (§\ref{['sec:initial-state']}) here.
  • Figure 2: Overview of the Toolathlon evaluation framework.
  • Figure 3: Example task instructions from our benchmark (Left) and MCPMark mcpmark2025 (Right). Ours contain more fuzzy intent that the model need to infer from the environment states.
  • Figure 4: Task topic distribution of Toolathlon.
  • Figure 5: Two kinds of tool calling error presence ratios in calling tools for different models.
  • ...and 6 more figures