Table of Contents
Fetching ...

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

Yunhe Yan, Shihe Wang, Jiajun Du, Yexuan Yang, Yuxuan Shan, Qichen Qiu, Xianqing Jia, Xinge Wang, Xin Yuan, Xu Han, Mao Qin, Yinxiao Chen, Chen Peng, Shangguang Wang, Mengwei Xu

TL;DR

MCPWorld addresses the lack of standardized benchmarks for API-enabled CUAs by offering a unified, white-box benchmarking testbed across API, GUI, and hybrid agents. It combines a containerized desktop environment, dynamic instrumentation, targeted code injection, and MCP-driven state querying to verify task progress from within the application under test. The benchmark suite includes 201 tasks across 10 open-source apps, and empirical results show that Hybrid configurations outperform GUI-Only and MCP-Only baselines, highlighting the value of combining API access with GUI automation. This framework provides a robust, tool-agnostic platform to drive development and fair comparison of next-generation computer-use agents.

Abstract

(M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents. A key principle of MCPWorld is the use of "white-box apps", i.e., those with source code availability and can be revised/re-compiled as needed (e.g., adding MCP support), with two notable advantages: (1) It greatly broadens the design space of CUA, such as what and how the app features to be exposed/extracted as CUA-callable APIs. (2) It allows MCPWorld to programmatically verify task completion by directly monitoring application behavior through techniques like dynamic code instrumentation, offering robust, accurate CUA evaluation decoupled from specific agent implementations or UI states. Currently, MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels. MCPWorld is also fully containerized with GPU acceleration support for flexible adoption on different OS/hardware environments. Our preliminary experiments, using a representative LLM-powered CUA framework, achieve 75.12% task completion accuracy, simultaneously providing initial evidence on the practical effectiveness of agent automation leveraging MCP. Overall, we anticipate MCPWorld to facilitate and standardize the benchmarking of next-generation computer use agents that can leverage rich external tools. Our code and dataset are publicly available at https://github.com/SAAgent/MCPWorld.

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

TL;DR

MCPWorld addresses the lack of standardized benchmarks for API-enabled CUAs by offering a unified, white-box benchmarking testbed across API, GUI, and hybrid agents. It combines a containerized desktop environment, dynamic instrumentation, targeted code injection, and MCP-driven state querying to verify task progress from within the application under test. The benchmark suite includes 201 tasks across 10 open-source apps, and empirical results show that Hybrid configurations outperform GUI-Only and MCP-Only baselines, highlighting the value of combining API access with GUI automation. This framework provides a robust, tool-agnostic platform to drive development and fair comparison of next-generation computer-use agents.

Abstract

(M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents. A key principle of MCPWorld is the use of "white-box apps", i.e., those with source code availability and can be revised/re-compiled as needed (e.g., adding MCP support), with two notable advantages: (1) It greatly broadens the design space of CUA, such as what and how the app features to be exposed/extracted as CUA-callable APIs. (2) It allows MCPWorld to programmatically verify task completion by directly monitoring application behavior through techniques like dynamic code instrumentation, offering robust, accurate CUA evaluation decoupled from specific agent implementations or UI states. Currently, MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels. MCPWorld is also fully containerized with GPU acceleration support for flexible adoption on different OS/hardware environments. Our preliminary experiments, using a representative LLM-powered CUA framework, achieve 75.12% task completion accuracy, simultaneously providing initial evidence on the practical effectiveness of agent automation leveraging MCP. Overall, we anticipate MCPWorld to facilitate and standardize the benchmarking of next-generation computer use agents that can leverage rich external tools. Our code and dataset are publicly available at https://github.com/SAAgent/MCPWorld.

Paper Structure

This paper contains 39 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The MCPWorld evaluation workflow. The Task Manager initializes the environment by loading task configurations and application data snapshots. It then starts the application within a Docker container. The Agent interacts with the application through a Unified Tool-based Space, receiving observations and sending actions (GUI or MCP). The Evaluator monitors internal application signals triggered by these interactions and, based on the registered handler, reports task success or failure.
  • Figure 2: Comparison of evaluation paradigms for a dynamic task: debugging in an IDE. (a) External UI Matching: This approach struggles with timing as the exact moment a breakpoint hits is unpredictable, making it difficult to capture the relevant UI state (e.g., an A11Y tree) for verification. (b) Output File Matching: This method relies on explicit final outputs (e.g., log files, saved documents) to verify task completion. However, crucial in-memory states during debugging (like call stacks or variable values at a breakpoint) are often not persisted, rendering this approach insufficient for verifying such intermediate steps. (c) App Internal Hooking: Our approach directly instruments the application to "Catch Event" at the exact moment of interest (e.g., a breakpoint hit). It can then extract "Exact Moment Info" such as call stacks and variable states directly from memory, providing robust and precise verification of both intermediate key steps and final outcomes, independent of UI transience or the need for explicit file outputs.
  • Figure 3: Distribution of tasks in MCPWorld spanning ten open source applications.
  • Figure 4: Comparison of success rate and key step completion rate across different modalities and bash settings.