Table of Contents
Fetching ...

MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents

Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, Yuxiao Dong

TL;DR

This work tackles the challenge of training mobile GUI agents through online agentic reinforcement learning, addressing sparse rewards, heavy-tailed task difficulty, and large-scale sampling bottlenecks. It introduces MobileRL, a framework that combines reasoning-free and reasoning fine-tuning with AdaGRPO, which itself integrates Shortest-Path Reward Adjustment, Difficulty-Adaptive Positive Replay, and Failure Curriculum Filtering to improve sample efficiency and stability. Empirical results on AndroidWorld and AndroidLab show state-of-the-art success rates with open backbones (e.g., GLM-4.1V-9B-Base achieving 80.2% and 53.6%), and ablations confirm the value of each AdaGRPO component and the reasoning SFT stages. The work also demonstrates scalable, reproducible training across hundreds of Android emulators, advancing practical deployment of autonomous mobile GUI agents and providing an open-source framework for future research.

Abstract

Building general-purpose graphical user interface (GUI) agents has become increasingly promising with the progress in vision language models. However, developing effective mobile GUI agents with reinforcement learning (RL) remains challenging due to the heavy-tailed distribution of task difficulty and the inefficiency of large-scale environment sampling. We present an online agentic reinforcement learning framework MobileRL to enhance GUI agents in mobile environments. Its core component is the Difficulty-ADAptive GRPO (ADAGRPO) algorithm. In ADAGRPO, we design difficulty-adaptive positive replay and failure curriculum filtering to adapt the model to different task difficulties. We introduce the shortest-path reward adjustment strategy to reshape rewards concerning the task length in multi-turn agentic tasks. Those strategies jointly stabilize RL training, improve sample efficiency, and generate strong performance across diverse mobile apps and tasks. We apply MOBILERL to two open models (Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base). The resultant MOBILERL-9B model achieves state-of-the-art results in terms of success rates on both AndroidWorld (80.2%) and AndroidLab (53.6%). The MOBILERL framework is open-sourced at: https://github.com/THUDM/MobileRL.

MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents

TL;DR

This work tackles the challenge of training mobile GUI agents through online agentic reinforcement learning, addressing sparse rewards, heavy-tailed task difficulty, and large-scale sampling bottlenecks. It introduces MobileRL, a framework that combines reasoning-free and reasoning fine-tuning with AdaGRPO, which itself integrates Shortest-Path Reward Adjustment, Difficulty-Adaptive Positive Replay, and Failure Curriculum Filtering to improve sample efficiency and stability. Empirical results on AndroidWorld and AndroidLab show state-of-the-art success rates with open backbones (e.g., GLM-4.1V-9B-Base achieving 80.2% and 53.6%), and ablations confirm the value of each AdaGRPO component and the reasoning SFT stages. The work also demonstrates scalable, reproducible training across hundreds of Android emulators, advancing practical deployment of autonomous mobile GUI agents and providing an open-source framework for future research.

Abstract

Building general-purpose graphical user interface (GUI) agents has become increasingly promising with the progress in vision language models. However, developing effective mobile GUI agents with reinforcement learning (RL) remains challenging due to the heavy-tailed distribution of task difficulty and the inefficiency of large-scale environment sampling. We present an online agentic reinforcement learning framework MobileRL to enhance GUI agents in mobile environments. Its core component is the Difficulty-ADAptive GRPO (ADAGRPO) algorithm. In ADAGRPO, we design difficulty-adaptive positive replay and failure curriculum filtering to adapt the model to different task difficulties. We introduce the shortest-path reward adjustment strategy to reshape rewards concerning the task length in multi-turn agentic tasks. Those strategies jointly stabilize RL training, improve sample efficiency, and generate strong performance across diverse mobile apps and tasks. We apply MOBILERL to two open models (Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base). The resultant MOBILERL-9B model achieves state-of-the-art results in terms of success rates on both AndroidWorld (80.2%) and AndroidLab (53.6%). The MOBILERL framework is open-sourced at: https://github.com/THUDM/MobileRL.

Paper Structure

This paper contains 30 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Left and Center: Task success rates on AndroidWorld rawles2024androidworlddynamicbenchmarkingenvironment and AndroidLab xu2024androidlabtrainingsystematicbenchmarking; hatched areas indicate gains from MobileRL on top of the SFT model. Right: Trajectory-level success rate curves on AndroidWorld train and test sets during RL training.
  • Figure 2: Overview of MobileRL. It consists of 1) reasoning warm-up with both reasoning-free SFT and reasoning SFT and 2) online agentic RL with AdaGRPO. In AdaGRPO, the warmed-up policy interacts with mobile environments to generate rollouts, which are scored by shortest-path reward adjustment (SPA). High-quality positive trajectories are stored in the AdaPR buffer, while low-performing rollouts are pruned via failure curriculum filtering.
  • Figure 3: Ablation studies of the MobileRL framework and its AdaGRPO algorithm. We use the Reasoning SFT model with Qwen2.5-VL-7B-Instruct backbone for the ablation of the AdaGRPO algorithm. All test set results are averaged over three runs to mitigate the impact of randomness.
  • Figure 4: Pass@$k$ on AndroidWorld by task complexity (rawles2024androidworlddynamicbenchmarkingenvironment) levels. Pass@$k$ is the fraction of tasks solved within the top-$k$ attempts.
  • Figure 5: Win rate of MobileRL vs. MobileRL w/o SPA, where a win means completing a task with fewer steps. $n$ denotes the number of task templates per category. Categories: All (all templates); C1--C4 (complexity levels 1--4); BC/BW (both methods correct/wrong); Others (exactly one method correct).