MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong
TL;DR
MCPEval introduces an automated, MCP-based framework for end-to-end evaluation of AI agents, addressing the need for scalable, protocol-grounded assessment beyond static benchmarks. It combines automated task generation, automated ground-truth verification, and dual evaluation channels—tool-call matching and LLM judge rubrics—to produce granular, domain-aware insights across five real-world domains. The study reveals consistent execution advantages over completion across model families and highlights domain-specific gaps, with OpenAI models typically leading in tool use and reasoning quality but open-source models showing competitive performance in certain tasks. By releasing MCPEval as open-source, the work aims to enable reproducible, scalable, and standardized evaluation for advancing robust, tool-enabled AI agents.
Abstract
The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce MCPEval, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.
