Table of Contents
Fetching ...

Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent

Yuhao Cheng, Liang Tang, Shuxian Li, Yukang Huo, Tiaonan Duan, Kaer Huang, Yanzhe Jing, Yiqiang Yan

TL;DR

The paper tackles the challenge of autonomous computer-use by introducing the Self-Evolution Agent (SEA), a compact $7\mathrm{B}$-parameter model that leverages a closed-loop, verifiable data pipeline, step-wise reinforcement learning, and grounding-based generalization to outperform or rival much larger models on GUI-based tasks. Key innovations include a data engine that auto-generates verifiable trajectories (via a Task Agent and a Coding Agent), the GATE trajectory extraction and a step-wise TR-SRL training regime built on GRPO to address sparse rewards, and a grounding-enhanced architecture achieved through model merging (DARE) and a Temporal Compressed Sensing Mechanism to boost perception efficiency. Empirical results on OSWorld and related benchmarks show SEA achieving state-of-the-art performance among 7B models and competitive results with larger models, with substantial gains demonstrated in grounding accuracy and task success rates. The work envisions open-sourcing SEA and related code to catalyze research and practical deployment in automated computer-use scenarios.

Abstract

Computer use agents represent an emerging area in artificial intelligence, aiming to operate computers autonomously to fulfill user tasks, attracting significant attention from both industry and academia. However, the performance of existing agents remains insufficient for practical deployment. In this paper, we propose the Self-Evolution Agent (SEA) for computer operation, alongside three core innovations in data generation, reinforcement learning, and model enhancement to develop this agent. Specifically, we first design an automatic pipeline to generate verifiable task trajectories for training. Second, we propose Efficient Step-wise Reinforcement Learning to reduce the substantial computational overhead of long-horizon training. Finally, we introduce a model enhancement method that integrates grounding and planning capabilities into a single model without additional training. Leveraging these innovations, our SEA (with only 7B parameters) outperforms existing models of the same parameter scale and achieves performance comparable to larger models (e.g., 32B/72B parameters) on computer use tasks. We plan to release the model weights and related code as open-source resources in the future.

Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent

TL;DR

The paper tackles the challenge of autonomous computer-use by introducing the Self-Evolution Agent (SEA), a compact -parameter model that leverages a closed-loop, verifiable data pipeline, step-wise reinforcement learning, and grounding-based generalization to outperform or rival much larger models on GUI-based tasks. Key innovations include a data engine that auto-generates verifiable trajectories (via a Task Agent and a Coding Agent), the GATE trajectory extraction and a step-wise TR-SRL training regime built on GRPO to address sparse rewards, and a grounding-enhanced architecture achieved through model merging (DARE) and a Temporal Compressed Sensing Mechanism to boost perception efficiency. Empirical results on OSWorld and related benchmarks show SEA achieving state-of-the-art performance among 7B models and competitive results with larger models, with substantial gains demonstrated in grounding accuracy and task success rates. The work envisions open-sourcing SEA and related code to catalyze research and practical deployment in automated computer-use scenarios.

Abstract

Computer use agents represent an emerging area in artificial intelligence, aiming to operate computers autonomously to fulfill user tasks, attracting significant attention from both industry and academia. However, the performance of existing agents remains insufficient for practical deployment. In this paper, we propose the Self-Evolution Agent (SEA) for computer operation, alongside three core innovations in data generation, reinforcement learning, and model enhancement to develop this agent. Specifically, we first design an automatic pipeline to generate verifiable task trajectories for training. Second, we propose Efficient Step-wise Reinforcement Learning to reduce the substantial computational overhead of long-horizon training. Finally, we introduce a model enhancement method that integrates grounding and planning capabilities into a single model without additional training. Leveraging these innovations, our SEA (with only 7B parameters) outperforms existing models of the same parameter scale and achieves performance comparable to larger models (e.g., 32B/72B parameters) on computer use tasks. We plan to release the model weights and related code as open-source resources in the future.

Paper Structure

This paper contains 29 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The comparison with the SOTA methods. This scatter plot depicts the OSWorld success rates of different open-source and closed-source models. Our proposed model achieves the highest success rate among open-source 7B models and substantially outperforms some other open-source alternatives with a larger number of parameters, even coming close to the performance of certain closed-source models.
  • Figure 2: Illustration of the closed-loop task data generation pipeline. The pipeline consists of two core agents: (1) Task Agent: Generates task instructions (using few-shot examples of real software tasks) and checks executability/duplication; (2) Coding Agent: Takes task instructions and guidelines as input to synthesize Python-based execution programs (for task completion) and verification programs (for validating task success). Only tasks passing automatic execution and verification are retained.
  • Figure 3: Illustration of the data generation and multi-stage trajectory filtering strategy. This figure illustrates the data generation and multi-stage trajectory filtering strategy. Beginning with task instructions, execution programs, and verification programs, our SEA generates multiple trajectories. These trajectories subsequently go through rule-based selection and filtering by a step filter model. Eventually, the filtered outcomes are utilized to periodically replace and update our SEA for performance refinement.
  • Figure 4: Illustration of Trajectory Reasoning by Step-wise Reinforcement Learning. This figure illustrates the process of trajectory reasoning via step-wise reinforcement learning (RL). Step-wise RL allows an agent to operate and obtain rewards at each step from Step 0 to Step n along a task trajectory. Each step comprises Input, Thought, Action, and Ground Truth components, with rewards calculated based on four facets: Overall Reward, Consistency Reward, Format Reward, and Step Reward.