Table of Contents
Fetching ...

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, Deheng Ye, Jie Jiang

Abstract

Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Abstract

Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.

Paper Structure

This paper contains 35 sections, 7 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Performance comparison of various GUI agents on AndroidWorld. Our UI-Voyager (4B) achieves an 81.0% Pass@1 success rate, outperforming larger models and exceeding reported human-level performance.
  • Figure 2: The whole pipeline of training UI-Voyager for mobile GUI tasks. It consists of two iterative stages: (1) Rejection Fine-Tuning (RFT), where a base policy generates multiple trajectories that are filtered by a rule-based verifier to collect high-quality samples for supervised fine-tuning; (2) Group Relative Self-Distillation (GRSD), which identifies "fork points" between successful and failed trajectory groups using SSIM matching and corrects erroneous actions to further refine the policy $\pi_m$ through mixed-data training.
  • Figure 3: Illustration of the fork point detection strategy. Given a successful trajectory $\tau^+$ and a failed trajectory $\tau^-$ for the same task, the fork point detection mechanism identifies steps in the failed trajectory where the screen state matches that of a successful step ($\text{SAME}(o_i^+, o_j^-)$) but the subsequent action leads to divergence ($\text{DIVERGE}(i, j)$), indicating that the action taken in the failed trajectory deviates from the successful one. See Sec. \ref{['sec:fpd']} for details.
  • Figure 4: RFT significantly boosts agent performance. Left: Pass@K performance across four iterative rounds of RFT. The results show consistent improvement in both Pass@1 and Pass@k as the self-evolution progresses. We select the checkpoint from the third RFT round (Pass@1=73.2%) for subsequent training. Right: Training curves of GRPO and PPO initialized from Qwen3-VL-4B-Instruct. The results show that directly deploying RL algorithms from Qwen3-VL-4B-Instruct model yields marginal gains and exhibits high sample inefficiency.
  • Figure 5: Example of fork point detection on BrowserMaze task. Both failed and successful trajectories share the same screen state at Step 12 (fork point). The failed trajectory takes an invalid "Right" action (blocked by a wall), while the successful trajectory takes the correct "Down" action. The fork point detection mechanism identifies this divergence and uses the correct action from the successful trajectory to supervise the failed one at this critical step.
  • ...and 4 more figures