Table of Contents
Fetching ...

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang

TL;DR

MMBench-GUI presents a cross-platform, hierarchical benchmark for GUI agents with four increasing levels (content understanding, grounding, task automation, task collaboration) and a novel Efficiency-Quality-Aware metric to jointly assess success and efficiency. It reveals that precise visual grounding is the main bottleneck, and modular grounding modules substantially improve performance, while long-horizon planning, memory, and cross-app coordination remain challenging. The benchmark spans Windows, macOS, Linux, Web, Android, and iOS (with macOS online tasks) and provides extensive task inventories, baselines, and analysis to guide future GUI-agent research. Together, these contributions offer a rigorous, practical framework to drive robust, scalable GUI automation across platforms.

Abstract

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

TL;DR

MMBench-GUI presents a cross-platform, hierarchical benchmark for GUI agents with four increasing levels (content understanding, grounding, task automation, task collaboration) and a novel Efficiency-Quality-Aware metric to jointly assess success and efficiency. It reveals that precise visual grounding is the main bottleneck, and modular grounding modules substantially improve performance, while long-horizon planning, memory, and cross-app coordination remain challenging. The benchmark spans Windows, macOS, Linux, Web, Android, and iOS (with macOS online tasks) and provides extensive task inventories, baselines, and analysis to guide future GUI-agent research. Together, these contributions offer a rigorous, practical framework to drive robust, scalable GUI automation across platforms.

Abstract

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.

Paper Structure

This paper contains 16 sections, 14 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: MMBench-GUI: a hierarchical benchmark spanning four levels of increasing difficulty, covering over 8,000 tasks across six commonly used platforms. From L1 to L4, task complexity increases progressively, placing growing demands on the agent’s generalization and reasoning abilities. Based on this benchmark, we visualize the performance of various models in the right figure, clearly illustrating their respective strengths as well as areas with substantial room for improvement.
  • Figure 2: Examples for L1&L2. Both of them are offline tasks. We provide examples from different platforms for each level. For clarity, some less critical fields are not shown here and full examples are available for download in our public repository.
  • Figure 3: Examples for L3&L4. Tasks of these levels are evaluated in the virtual environment with an online manner. In L4, we provide two images belonging to different applications as examples to demonstrate that collaboration is the core aspect for this level.
  • Figure 4: Left: Demonstrates the relative contribution of visual grounding versus planning in driving performance gains under current conditions. We consider two experimental conditions—fixing the planner while varying the grounder, and vice versa—and examine how different combinations affect task success rate. Similar color hues denote groups with the same fixed planner or grounder. Right: Task success grows roughly linearly with visual-grounding accuracy. General-purpose language models are virtually “blind” at the L2 grounding stage, which drives their L3 automation success rate (SR) sharply down. Plugging in a dedicated visual grounder restores precise perception and, in turn, lifts SR dramatically—highlighting fine-grained grounding as the principal bottleneck.
  • Figure 5: EQA visualization across different models under L3 for different allowed steps. As discussed in Section \ref{['sec:level_3']}, EQA reflects a combination of task completion and efficiency (i.e., the number of steps used upon completion). In practice, we compute it by interpolating both the step budget and the success rate (SR) 100 times. The area under the curve formed by these interpolated SR values yields the final EQA score.
  • ...and 1 more figures