Table of Contents
Fetching ...

GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

Kangjia Zhao, Jiahui Song, Leigang Sha, Haozhan Shen, Zhi Chen, Tiancheng Zhao, Xiubo Liang, Jianwei Yin

TL;DR

This work tackles the lack of a unified, end-to-end benchmark for autonomous GUI testing agents. It introduces GTArena, a formalized framework that decomposes GUI testing into test intention generation, test task execution, and GUI defect detection, supported by a novel GUI defect data structure and a POMDP-based decision model. The paper details three data sources—real-world defects, injected defects, and synthetic defects—and proposes a correlation-analysis method to relate general model capabilities to GUI-specific performance. Experimental results reveal notable gaps between current multimodal models and practical GUI testing needs, offering concrete directions to improve end-to-end capability and reproducibility; code is made available at GitHub.

Abstract

Nowadays, research on GUI agents is a hot topic in the AI community. However, current research focuses on GUI task automation, limiting the scope of applications in various GUI scenarios. In this paper, we propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing (GTArena), offering a fair, standardized environment for consistent operation of diverse multimodal large language models. We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection, and construct a benchmark dataset based on these to conduct a comprehensive evaluation. It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data, thoroughly assessing their capabilities in this relevant task. Additionally, we propose a method that helps researchers explore the correlation between the performance of multimodal language large models in specific scenarios and their general capabilities in standard benchmark tests. Experimental results indicate that even the most advanced models struggle to perform well across all sub-tasks of automated GUI Testing, highlighting a significant gap between the current capabilities of Autonomous GUI Testing and its practical, real-world applicability. This gap provides guidance for the future direction of GUI Agent development. Our code is available at https://github.com/ZJU-ACES-ISE/ChatUITest.

GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

TL;DR

This work tackles the lack of a unified, end-to-end benchmark for autonomous GUI testing agents. It introduces GTArena, a formalized framework that decomposes GUI testing into test intention generation, test task execution, and GUI defect detection, supported by a novel GUI defect data structure and a POMDP-based decision model. The paper details three data sources—real-world defects, injected defects, and synthetic defects—and proposes a correlation-analysis method to relate general model capabilities to GUI-specific performance. Experimental results reveal notable gaps between current multimodal models and practical GUI testing needs, offering concrete directions to improve end-to-end capability and reproducibility; code is made available at GitHub.

Abstract

Nowadays, research on GUI agents is a hot topic in the AI community. However, current research focuses on GUI task automation, limiting the scope of applications in various GUI scenarios. In this paper, we propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing (GTArena), offering a fair, standardized environment for consistent operation of diverse multimodal large language models. We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection, and construct a benchmark dataset based on these to conduct a comprehensive evaluation. It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data, thoroughly assessing their capabilities in this relevant task. Additionally, we propose a method that helps researchers explore the correlation between the performance of multimodal language large models in specific scenarios and their general capabilities in standard benchmark tests. Experimental results indicate that even the most advanced models struggle to perform well across all sub-tasks of automated GUI Testing, highlighting a significant gap between the current capabilities of Autonomous GUI Testing and its practical, real-world applicability. This gap provides guidance for the future direction of GUI Agent development. Our code is available at https://github.com/ZJU-ACES-ISE/ChatUITest.

Paper Structure

This paper contains 21 sections, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: The Workflow for Autonomous GUI Testing (GTArena). GUI Testing requires the model to perform specific tasks, all of which are evaluated within this workflow. We provide a standardized and reproducible testing framework, enabling fair comparison of different multimodal large language models.
  • Figure 2: Source and Methodology for Benchmark Data Construction. The left side of the figure illustrates our primary data sources, which include intentionally injected defects within apps and synthetic defect data generated by post-processing action sequence data obtained from app executions. The right side of the figure shows supplemental data sources, specifically real-world applications with GUI defects.
  • Figure 3: Examples of Constructed Synthetic GUI Defects. We present examples of various constructed GUI defects, demonstrating the feasibility of synthesizing defects through post-processing. This approach highlights a method for building large-scale GUI defect datasets, including both display and interaction defects.
  • Figure 4: Example Defects in Artificial Injected Data.
  • Figure 5: Example episode form the AitW_with_Defects.