MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

Yunhe Yan; Shihe Wang; Jiajun Du; Yexuan Yang; Yuxuan Shan; Qichen Qiu; Xianqing Jia; Xinge Wang; Xin Yuan; Xu Han; Mao Qin; Yinxiao Chen; Chen Peng; Shangguang Wang; Mengwei Xu

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

Yunhe Yan, Shihe Wang, Jiajun Du, Yexuan Yang, Yuxuan Shan, Qichen Qiu, Xianqing Jia, Xinge Wang, Xin Yuan, Xu Han, Mao Qin, Yinxiao Chen, Chen Peng, Shangguang Wang, Mengwei Xu

TL;DR

MCPWorld addresses the lack of standardized benchmarks for API-enabled CUAs by offering a unified, white-box benchmarking testbed across API, GUI, and hybrid agents. It combines a containerized desktop environment, dynamic instrumentation, targeted code injection, and MCP-driven state querying to verify task progress from within the application under test. The benchmark suite includes 201 tasks across 10 open-source apps, and empirical results show that Hybrid configurations outperform GUI-Only and MCP-Only baselines, highlighting the value of combining API access with GUI automation. This framework provides a robust, tool-agnostic platform to drive development and fair comparison of next-generation computer-use agents.

Abstract

(M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents. A key principle of MCPWorld is the use of "white-box apps", i.e., those with source code availability and can be revised/re-compiled as needed (e.g., adding MCP support), with two notable advantages: (1) It greatly broadens the design space of CUA, such as what and how the app features to be exposed/extracted as CUA-callable APIs. (2) It allows MCPWorld to programmatically verify task completion by directly monitoring application behavior through techniques like dynamic code instrumentation, offering robust, accurate CUA evaluation decoupled from specific agent implementations or UI states. Currently, MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels. MCPWorld is also fully containerized with GPU acceleration support for flexible adoption on different OS/hardware environments. Our preliminary experiments, using a representative LLM-powered CUA framework, achieve 75.12% task completion accuracy, simultaneously providing initial evidence on the practical effectiveness of agent automation leveraging MCP. Overall, we anticipate MCPWorld to facilitate and standardize the benchmarking of next-generation computer use agents that can leverage rich external tools. Our code and dataset are publicly available at https://github.com/SAAgent/MCPWorld.

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

TL;DR

Abstract

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)