Table of Contents
Fetching ...

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang

TL;DR

SEAgent introduces an autonomous self-evolving framework for computer use agents that learns from experience without human annotations. It combines a World State Model for step-level trajectory judgment, a self-updating Curriculum Generator, and a reinforcement learning loop with adversarial imitation and Group Relative Policy Optimization, enabling both specialist and generalist CUAs. A specialist-to-generalist training strategy distills expert trajectories into a robust generalist that outperforms single-software specialists and prior RL baselines. Evaluations on OSWorld show substantial performance gains, illustrating the practicality of self-driven evolution for GUI-based software tasks. The approach highlights a path toward more versatile, autonomously improving agents in complex, real-world software environments.

Abstract

Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

TL;DR

SEAgent introduces an autonomous self-evolving framework for computer use agents that learns from experience without human annotations. It combines a World State Model for step-level trajectory judgment, a self-updating Curriculum Generator, and a reinforcement learning loop with adversarial imitation and Group Relative Policy Optimization, enabling both specialist and generalist CUAs. A specialist-to-generalist training strategy distills expert trajectories into a robust generalist that outperforms single-software specialists and prior RL baselines. Evaluations on OSWorld show substantial performance gains, illustrating the practicality of self-driven evolution for GUI-based software tasks. The approach highlights a path toward more versatile, autonomously improving agents in complex, real-world software environments.

Abstract

Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.

Paper Structure

This paper contains 40 sections, 6 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: SEAgent enables computer use agents self-evolving in novel environments by autonomously exploring and learning from their own experiences without human intervention. The specialist-to-generalist training strategy further enhances the development of a strong generalist agent.
  • Figure 2: SEAgent autonomous exploration and experiential learning pipeline. Guided by tasks generated by the Curriculum Generator, the Actor Model is updated according to step-level rewards from the World State Model through verifiable reward functions tailored for different action types.
  • Figure 3: The Average Precision on AgentRewardBench lu2025agentrewardbench, where GUI-Judge exhibits an improvement in AP as the number of input middle states increases, showing a similar trend to that of the closed sourced GPT-4o hurst2024gpt when compared with its base model.
  • Figure 4: Self-evolved task instructions and success rate (SR) curves across different software. Tasks are progressively upgraded by the Curriculum Generator without human intervention, based on the evolving capabilities of the Actor Model at different training phases.
  • Figure 5: SEAgent autonomous exploration pipeline. The agent (policy model) and World State Model iteratively generate new task and perform RL to become a specialist in novel software.
  • ...and 7 more figures