Table of Contents
Fetching ...

Generalization in Online Reinforcement Learning for Mobile Agents

Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang Wang

TL;DR

Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, and it is demonstrated that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction.

Abstract

Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}.

Generalization in Online Reinforcement Learning for Mobile Agents

TL;DR

Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, and it is demonstrated that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction.

Abstract

Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}.
Paper Structure (34 sections, 4 equations, 14 figures, 5 tables)

This paper contains 34 sections, 4 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Sample task instructions with corresponding screenshots from the train and test sets of the three unseen regimes in the AndroidWorld-Generalization benchmark. Red highlights the unseen scenarios: Instance, Template, and Application.
  • Figure 2: RL training system for mobile agent. We integrate GRPO with a scalable rollout collection system that parallelizes multiple environments. Docker containerization provides resource isolation and decouples trainer and environments through HTTP communication for reliability. Asynchronous rollouts eliminate synchronization bottlenecks, enabling more agent steps per unit time. Together, these three techniques facilitate reliable and efficient large-scale training.
  • Figure 3: Training dynamics on Unseen Instance with curriculum learning. Colored areas denote curriculum stages: blue (Easy), red (Easy + Medium), green (All). (Left) Average training and evaluation success rates. (Right) Average evaluation success rates by task type (Task Completion, Information Retrieval) and difficulty (Easy, Medium, Hard).
  • Figure 4: Training dynamics of GRPO across the three unseen regimes. We report training success rates and evaluation success rates with standard deviations.
  • Figure 5: Training dynamics of PPO across the three unseen regimes.
  • ...and 9 more figures