Table of Contents
Fetching ...

ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay

Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, Jiaya Jia

TL;DR

ARPO introduces an end-to-end policy optimization framework for vision-language GUI agents by augmenting GRPO with an experience replay buffer and a valuable-task selection strategy. The method enables scalable distributed rollouts in real desktop environments and leverages long-context multimodal reasoning to handle long-horizon GUI tasks. Empirical results on OSWorld show substantial in-domain gains over GRPO and offline preference baselines, with improved sample efficiency and stable training; generalization to out-of-domain tasks is more modest but improved with the replay mechanism. The work highlights the practicality of RL for multi-turn GUI control and provides a solid baseline for future exploration of longer context, richer reward signals, and broader task coverage in real-world UI interactions.

Abstract

Training large language models (LLMs) as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long-horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi-turn reinforcement learning (RL) for reasoning and tool-using capabilities in LLMs, their application to GUI-based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy-based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM-based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi-turn, vision-language GUI agents capable of managing complex real-world UI interactions. Codes and models:https://github.com/dvlab-research/ARPO.git.

ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay

TL;DR

ARPO introduces an end-to-end policy optimization framework for vision-language GUI agents by augmenting GRPO with an experience replay buffer and a valuable-task selection strategy. The method enables scalable distributed rollouts in real desktop environments and leverages long-context multimodal reasoning to handle long-horizon GUI tasks. Empirical results on OSWorld show substantial in-domain gains over GRPO and offline preference baselines, with improved sample efficiency and stable training; generalization to out-of-domain tasks is more modest but improved with the replay mechanism. The work highlights the practicality of RL for multi-turn GUI control and provides a solid baseline for future exploration of longer context, richer reward signals, and broader task coverage in real-world UI interactions.

Abstract

Training large language models (LLMs) as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long-horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi-turn reinforcement learning (RL) for reasoning and tool-using capabilities in LLMs, their application to GUI-based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy-based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM-based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi-turn, vision-language GUI agents capable of managing complex real-world UI interactions. Codes and models:https://github.com/dvlab-research/ARPO.git.

Paper Structure

This paper contains 26 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Agentic Replay Policy Optimization (ARPO) enables effective end-to-end policy optimization for GUI agents. (a) Our vision-language agent processes long-horizon visual observations and interaction histories to generate sequential actions and receive policy gradients from sparse, delayed rewards. (b) ARPO significantly boosts in-domain task success rates compared to baseline and GRPO-only training. (c) Average training reward steadily increases, demonstrating improved policy learning and sample efficiency in complex GUI environments.
  • Figure 2: Illustration of the reinforcement learning procedure for our multi-turn GUI agent. For a single task, we use $n$ parallel environments and perform rollouts to collect trajectories and rewards $\{\tau_i, r_i\}_{i=0, 1, \cdots, n-1}$ in the environments. If all the rewards are zero, we fetch a positive trajectory $\tau^+$ from the replay buffer to avoid gradient vanishing.
  • Figure 3: Ablation study of the replay buffer.
  • Figure 4: Ablation study for GRPO and ARPO in-domain and out-of-domain RL training tasks.
  • Figure 5: Training performance comparison for RL training with selected subset and full set.
  • ...and 3 more figures