Table of Contents
Fetching ...

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents

Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, Jie Tang

TL;DR

ComputerRL tackles end-to-end desktop agents by merging API-based control with GUI interactions, enabling scalable RL training across thousands of virtual desktops. It introduces Entropulse to mitigate entropy collapse and KL divergence, while building a large-scale Ubuntu-based infrastructure compatible with AgentBench. The paper reports state-of-the-art OS automation performance on OSWorld and OfficeWorld with AutoGLM-OS, demonstrating improved efficiency and generalization. These contributions offer a practical path toward robust, long-horizon autonomous desktop assistants in real-world settings.

Abstract

We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks; however, it remains challenging due to environmental inefficiency and instability during extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs. We employ ComputerRL on open models GLM-4-9B-0414 and GLM-4.1V-9B-Thinking, and evaluate them on the OSWorld benchmark. The AutoGLM-OS-9B achieves a new state-of-the-art accuracy of 48.9%, demonstrating significant improvements for general agents in desktop automation. Our code and the new OfficeWorld benchmark are available at https://github.com/thudm/ComputerRL. The algorithm and framework are adopted in building AutoGLM (Liu et al., 2024b).

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents

TL;DR

ComputerRL tackles end-to-end desktop agents by merging API-based control with GUI interactions, enabling scalable RL training across thousands of virtual desktops. It introduces Entropulse to mitigate entropy collapse and KL divergence, while building a large-scale Ubuntu-based infrastructure compatible with AgentBench. The paper reports state-of-the-art OS automation performance on OSWorld and OfficeWorld with AutoGLM-OS, demonstrating improved efficiency and generalization. These contributions offer a practical path toward robust, long-horizon autonomous desktop assistants in real-world settings.

Abstract

We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks; however, it remains challenging due to environmental inefficiency and instability during extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs. We employ ComputerRL on open models GLM-4-9B-0414 and GLM-4.1V-9B-Thinking, and evaluate them on the OSWorld benchmark. The AutoGLM-OS-9B achieves a new state-of-the-art accuracy of 48.9%, demonstrating significant improvements for general agents in desktop automation. Our code and the new OfficeWorld benchmark are available at https://github.com/thudm/ComputerRL. The algorithm and framework are adopted in building AutoGLM (Liu et al., 2024b).

Paper Structure

This paper contains 32 sections, 2 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: ComputerRL enables efficient end-to-end online policy optimization for OS agents. (a) On OSWorld xie2024osworld, AutoGLM-OS, trained with ComputerRL, outperforms state-of-the-art agents. (b) Our Entropulse approach yields higher average training rewards and improves both learning efficiency and final performance over conventional methods.
  • Figure 2: Examples of AutoGLM-OS's execution on four user tasks, including image processing between GIMP and LibreOffice Writer, monitoring system resource usage in Terminal, table calculation in LibreOffice Calc, and document formatting in LibreOffice Writer.
  • Figure 3: Overview of ComputerRL framework. We introduce an API-GUI action paradigm that seamlessly integrates automatically constructed APIs with GUI actions to improve agent efficiency and effectiveness. A large-scale parallel desktop environment with 1,000+ real-world instances, combined with an asynchronous RL framework, enables efficient sampling and robust agent training.
  • Figure 4: Overview of ComputerRL, which includes three stages: (1) BC cold start with trajectories collected from general LLMs; (2) RL with step-level GRPO using verifiable, rule-based rewards; (3) Entropulse, which alternates RL with SFT on correct rollouts to restore entropy and sustain learning.
  • Figure 5: ComputerRL training curves of reward (left) and entropy (right) with 95% confidence intervals. The red line denotes the training with entropy recovery via Entropulse after the first RL stage, while the grey line denotes continued training with only reference resetting.
  • ...and 10 more figures