Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent

Yuhao Cheng; Liang Tang; Shuxian Li; Yukang Huo; Tiaonan Duan; Kaer Huang; Yanzhe Jing; Yiqiang Yan

Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent

Yuhao Cheng, Liang Tang, Shuxian Li, Yukang Huo, Tiaonan Duan, Kaer Huang, Yanzhe Jing, Yiqiang Yan

TL;DR

The paper tackles the challenge of autonomous computer-use by introducing the Self-Evolution Agent (SEA), a compact $7\mathrm{B}$-parameter model that leverages a closed-loop, verifiable data pipeline, step-wise reinforcement learning, and grounding-based generalization to outperform or rival much larger models on GUI-based tasks. Key innovations include a data engine that auto-generates verifiable trajectories (via a Task Agent and a Coding Agent), the GATE trajectory extraction and a step-wise TR-SRL training regime built on GRPO to address sparse rewards, and a grounding-enhanced architecture achieved through model merging (DARE) and a Temporal Compressed Sensing Mechanism to boost perception efficiency. Empirical results on OSWorld and related benchmarks show SEA achieving state-of-the-art performance among 7B models and competitive results with larger models, with substantial gains demonstrated in grounding accuracy and task success rates. The work envisions open-sourcing SEA and related code to catalyze research and practical deployment in automated computer-use scenarios.

Abstract

Computer use agents represent an emerging area in artificial intelligence, aiming to operate computers autonomously to fulfill user tasks, attracting significant attention from both industry and academia. However, the performance of existing agents remains insufficient for practical deployment. In this paper, we propose the Self-Evolution Agent (SEA) for computer operation, alongside three core innovations in data generation, reinforcement learning, and model enhancement to develop this agent. Specifically, we first design an automatic pipeline to generate verifiable task trajectories for training. Second, we propose Efficient Step-wise Reinforcement Learning to reduce the substantial computational overhead of long-horizon training. Finally, we introduce a model enhancement method that integrates grounding and planning capabilities into a single model without additional training. Leveraging these innovations, our SEA (with only 7B parameters) outperforms existing models of the same parameter scale and achieves performance comparable to larger models (e.g., 32B/72B parameters) on computer use tasks. We plan to release the model weights and related code as open-source resources in the future.

Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent

TL;DR

The paper tackles the challenge of autonomous computer-use by introducing the Self-Evolution Agent (SEA), a compact

-parameter model that leverages a closed-loop, verifiable data pipeline, step-wise reinforcement learning, and grounding-based generalization to outperform or rival much larger models on GUI-based tasks. Key innovations include a data engine that auto-generates verifiable trajectories (via a Task Agent and a Coding Agent), the GATE trajectory extraction and a step-wise TR-SRL training regime built on GRPO to address sparse rewards, and a grounding-enhanced architecture achieved through model merging (DARE) and a Temporal Compressed Sensing Mechanism to boost perception efficiency. Empirical results on OSWorld and related benchmarks show SEA achieving state-of-the-art performance among 7B models and competitive results with larger models, with substantial gains demonstrated in grounding accuracy and task success rates. The work envisions open-sourcing SEA and related code to catalyze research and practical deployment in automated computer-use scenarios.

Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent

TL;DR

Abstract

Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)