Table of Contents
Fetching ...

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong

TL;DR

MCPEval introduces an automated, MCP-based framework for end-to-end evaluation of AI agents, addressing the need for scalable, protocol-grounded assessment beyond static benchmarks. It combines automated task generation, automated ground-truth verification, and dual evaluation channels—tool-call matching and LLM judge rubrics—to produce granular, domain-aware insights across five real-world domains. The study reveals consistent execution advantages over completion across model families and highlights domain-specific gaps, with OpenAI models typically leading in tool use and reasoning quality but open-source models showing competitive performance in certain tasks. By releasing MCPEval as open-source, the work aims to enable reproducible, scalable, and standardized evaluation for advancing robust, tool-enabled AI agents.

Abstract

The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce MCPEval, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

TL;DR

MCPEval introduces an automated, MCP-based framework for end-to-end evaluation of AI agents, addressing the need for scalable, protocol-grounded assessment beyond static benchmarks. It combines automated task generation, automated ground-truth verification, and dual evaluation channels—tool-call matching and LLM judge rubrics—to produce granular, domain-aware insights across five real-world domains. The study reveals consistent execution advantages over completion across model families and highlights domain-specific gaps, with OpenAI models typically leading in tool use and reasoning quality but open-source models showing competitive performance in certain tasks. By releasing MCPEval as open-source, the work aims to enable reproducible, scalable, and standardized evaluation for advancing robust, tool-enabled AI agents.

Abstract

The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce MCPEval, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

Paper Structure

This paper contains 74 sections, 21 figures, 10 tables.

Figures (21)

  • Figure 1: User interface of the MCPEval framework. The dashboard provides streamlined access to core functionalities such as automatic task generation, verification, model evaluation, and result analysis. It integrates real-time activity tracking and system overviews to ensure transparency and ease of use.
  • Figure 2: Two-step MCP-based task generation workflow, including initial generation phase and verification phase.
  • Figure 3: MCPEval evaluation workflow shows MCP client/server interaction, tool call correctness checking, LLM judger assessment, automated report generation.
  • Figure 4: Domain performance analysis: (a) Domain ranking by LLM judger, (b) Trajectory vs completion comparison, (c) Task distribution, (d) Performance gaps by domain.
  • Figure 5: Performance gap analysis: (a) Overall gap distribution, (b) Model-wise gaps, (c) Domain-wise gaps, (d) Gap-performance correlation.
  • ...and 16 more figures