Table of Contents
Fetching ...

GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration

Yuchen Sun, Shanhui Zhao, Tao Yu, Hao Wen, Samith Va, Mengwei Xu, Yuanchun Li, Chongyang Zhang

TL;DR

GUI-Xplore introduces a cross-app, cross-task GUI dataset built from per-app exploration videos and five hierarchical downstream tasks, addressing generalization gaps in GUI agents. The Xplore-Agent baseline combines Action-aware GUI Modeling with a GUI Transition Graph to enable exploration-guided reasoning, achieving around a 10% improvement in unfamiliar apps. The study demonstrates the value of exploration-then-reasoning for robust cross-domain GUI understanding, while also outlining practical limitations such as text-only outputs and data privacy concerns. Overall, the work provides a concrete dataset and baseline that push toward more versatile GUI agents capable of adapting to diverse software environments.

Abstract

GUI agents hold significant potential to enhance the experience and efficiency of human-device interaction. However, current methods face challenges in generalizing across applications (apps) and tasks, primarily due to two fundamental limitations in existing datasets. First, these datasets overlook developer-induced structural variations among apps, limiting the transferability of knowledge across diverse software environments. Second, many of them focus solely on navigation tasks, which restricts their capacity to represent comprehensive software architectures and complex user interactions. To address these challenges, we introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization via an exploration-and-reasoning framework. GUI-Xplore integrates pre-recorded exploration videos providing contextual insights, alongside five hierarchically structured downstream tasks designed to comprehensively evaluate GUI agent capabilities. To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning. Further experiments indicate that Xplore-Agent achieves a 10% improvement over existing methods in unfamiliar environments, yet there remains significant potential for further enhancement towards truly generalizable GUI agents.

GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration

TL;DR

GUI-Xplore introduces a cross-app, cross-task GUI dataset built from per-app exploration videos and five hierarchical downstream tasks, addressing generalization gaps in GUI agents. The Xplore-Agent baseline combines Action-aware GUI Modeling with a GUI Transition Graph to enable exploration-guided reasoning, achieving around a 10% improvement in unfamiliar apps. The study demonstrates the value of exploration-then-reasoning for robust cross-domain GUI understanding, while also outlining practical limitations such as text-only outputs and data privacy concerns. Overall, the work provides a concrete dataset and baseline that push toward more versatile GUI agents capable of adapting to diverse software environments.

Abstract

GUI agents hold significant potential to enhance the experience and efficiency of human-device interaction. However, current methods face challenges in generalizing across applications (apps) and tasks, primarily due to two fundamental limitations in existing datasets. First, these datasets overlook developer-induced structural variations among apps, limiting the transferability of knowledge across diverse software environments. Second, many of them focus solely on navigation tasks, which restricts their capacity to represent comprehensive software architectures and complex user interactions. To address these challenges, we introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization via an exploration-and-reasoning framework. GUI-Xplore integrates pre-recorded exploration videos providing contextual insights, alongside five hierarchically structured downstream tasks designed to comprehensively evaluate GUI agent capabilities. To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning. Further experiments indicate that Xplore-Agent achieves a 10% improvement over existing methods in unfamiliar environments, yet there remains significant potential for further enhancement towards truly generalizable GUI agents.

Paper Structure

This paper contains 29 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison between the current GUI agent paradigm and our exploration-based paradigm. (a) The current paradigm only learns generalized GUI knowledge during the training stage, lacking app-specific knowledge for inference in unfamiliar apps. For example, experience with PayPal can not translate to guidance for operating WeChat. (b) Our exploration-based paradigm provides exploration videos for each app, offering rich information of the entire app, that enable the model to learn both generalized GUI knowledge and exploration-guided learning ability. In this example, by equipping the GUI agent with knowledge from the exploration video, it can not only identify proper operation sequence for a given task, but also provide additional information according to the exploration.
  • Figure 2: Sample data from five downstream tasks. GUI-Xplore provides app exploration videos paired with five downstream tasks. The videos comprehensively capture all page and action information during the exploration phase. The downstream task employs multiple-choice question answering, targeting different granularity of page and action information. Detailed samples are shown in appendix.
  • Figure 3: An overview of the Xplore-Agent pipeline. The model takes an exploration video and a task query as inputs, generating predicted answers. Specifically, the exploration video is converted into a textual exploration sequence through Action-aware Keyframe Extraction, View Hierarchy Generation, and Action Generation. The GUI Clustering Model then groups screens with similar functionalities, transforming the linear sequence into a GUI Transition Graph. Finally, the nodes and edges are used to compose the prompt for querying LLM.