Table of Contents
Fetching ...

HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, Gongwei Chen

Abstract

Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.

HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

Abstract

Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.
Paper Structure (22 sections, 10 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 22 sections, 10 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of trajectory synthesis paradigms. Compared with (a) existing methods, (b) HATS integrates hardness-driven exploration and alignment-guided refinement in a closed loop, producing high-quality trajectories with rich semantic coverage and strong instruction--execution alignment. (c) Experiments show HATS outperforms OS-Genesis by 100%↑ on AndroidWorld (22.60 vs. 11.30) and 215%↑ on WebArena (20.60 vs. 6.53).
  • Figure 2: Illustrative cases of semantic-ambiguous actions. Such actions constitute critical bottlenecks for robust agent generalization but are rarely captured by existing synthesis pipelines.
  • Figure 3: Architecture of the HATS framework. The framework integrates a Hardness-Driven Exploration module (§\ref{['sec_module_explore']}) and an Alignment-Guided Refinement module (§\ref{['sec_module_alignment']}) within a unified HD-MCTS loop. Exploration corresponds to the Selection, Expansion, and Simulation Phase I, while refinement handles Simulation Phase II and Backpropagation. Misalignment detected during refinement is converted into a hardness reward that guides subsequent exploration, forming a closed loop for progressively improving both diversity and semantic fidelity of synthesized trajectories.
  • Figure 4: Comparison of exploration strategies. Uniform Random Explorationsun2024genesis often yields trivial and redundant actions, whereas our Hardness-Driven Exploration replaces random walking with a hardness-driven exploration policy that selectively targets under-represented yet semantically challenging actions.
  • Figure 5: Comparison of instruction synthesis methods. One-Shot Instruction Generationsun2024genesis directly maps raw traces to text, often yielding vague or underspecified goals and inconsistent executions. In contrast, our Multi-Round Alignment-Guided Refinement iteratively replays and verifies task instructions to produce semantically faithful and executable Verified Trajectories.
  • ...and 7 more figures