Table of Contents
Fetching ...

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He

TL;DR

This work tackles data scarcity for GUI agents by introducing a mid-training paradigm that leverages abundant, non-GUI data to build foundational capabilities for GUI planning and perception. The authors curate eleven mid-training datasets (seven multimodal, four textual) and show that mathematical reasoning data—even in text form—transfers strongly to GUI tasks, while GUI perception data yields comparatively smaller gains. They construct GUIMid, a mixed mid-training dataset, and demonstrate state-of-the-art results on AndroidWorld and WebArena when combined with GUI trajectory fine-tuning, evidencing robust cross-domain transfer and scaling benefits. The results provide actionable guidance for data-efficient GUI agent training and establish a practical framework for addressing trajectory data scarcity in this emerging field, with code and data to be released.

Abstract

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

TL;DR

This work tackles data scarcity for GUI agents by introducing a mid-training paradigm that leverages abundant, non-GUI data to build foundational capabilities for GUI planning and perception. The authors curate eleven mid-training datasets (seven multimodal, four textual) and show that mathematical reasoning data—even in text form—transfers strongly to GUI tasks, while GUI perception data yields comparatively smaller gains. They construct GUIMid, a mixed mid-training dataset, and demonstrate state-of-the-art results on AndroidWorld and WebArena when combined with GUI trajectory fine-tuning, evidencing robust cross-domain transfer and scaling benefits. The results provide actionable guidance for data-efficient GUI agent training and establish a practical framework for addressing trajectory data scarcity in this emerging field, with code and data to be released.

Abstract

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

Paper Structure

This paper contains 34 sections, 1 equation, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of the mid-training and fine-tuning process. Left: We first train the GUI agent on mid-training data, primarily from non-GUI domains, to investigate whether the enhanced capabilities can generalize to GUI agent tasks; Right: We perform post-training on GUI trajectory data.
  • Figure 2: A case illustrating the performance of the Model w/o Mid-Training and the Model w/ Mid-Training under the same task. The middle text shows the model’s thought process and the action taken, while the screenshots on the left and right represent the screen states before and after the action, respectively. The model with middle training (bottom) successfully reflects on errors and generates correct actions from error states, while the model without mid-training (top) fails to recover from such states.
  • Figure 3: Performance of models trained on GUIMid with different scales.
  • Figure 4: Comparison of training loss between two training strategies: (a) and (c) show the mixture of GUI trajectory data during mid-training, while (b) and (d) are not.
  • Figure 5: The annotation UI for VisualWebArena.
  • ...and 6 more figures