Table of Contents
Fetching ...

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow

TL;DR

MCP-Bench addresses the need for realistic, multi-step tool use by connecting LLM agents to a diverse MCP server ecosystem (28 servers, 250 tools) and automatically generating 104 complex tasks with fuzzy instructions. It formalizes agent operation as a POMDP with multi-round planning and inter-server coordination, and evaluates agents with a hybrid rule-based and rubric-based LLM judge, including prompt-shuffling to boost robustness. Experiments on 20 LLMs reveal that while schema understanding is near-universal, high-quality long-horizon planning and cross-domain orchestration remain challenging, particularly for smaller models. The benchmark thus provides a scalable, ecosystem-aware platform to drive progress in agentic reasoning, tool coordination, and grounding in real-world tool networks.

Abstract

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

TL;DR

MCP-Bench addresses the need for realistic, multi-step tool use by connecting LLM agents to a diverse MCP server ecosystem (28 servers, 250 tools) and automatically generating 104 complex tasks with fuzzy instructions. It formalizes agent operation as a POMDP with multi-round planning and inter-server coordination, and evaluates agents with a hybrid rule-based and rubric-based LLM judge, including prompt-shuffling to boost robustness. Experiments on 20 LLMs reveal that while schema understanding is near-universal, high-quality long-horizon planning and cross-domain orchestration remain challenging, particularly for smaller models. The benchmark thus provides a scalable, ecosystem-aware platform to drive progress in agentic reasoning, tool coordination, and grounding in real-world tool networks.

Abstract

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

Paper Structure

This paper contains 25 sections, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: MCP-Bench connects LLM agents to real-world MCP servers exposing 250 structured tools across domains such as finance, science, and research. Tasks are generated via LLM-based synthesis, then executed by the agent through multi-turn tool invocations. Each execution trajectory is evaluated using a combination of rule-based checks and LLM-as-a-Judge scoring, assessing agent performance in tool schema understanding, multi-hop planning, and real-world adaptability.
  • Figure 2: Category distribution of MCP servers.
  • Figure 3: Tool distribution across servers.