Table of Contents
Fetching ...

UI-Evol: Automatic Knowledge Evolving for Computer Use Agents

Ziyun Zhang, Xinyi Liu, Xiaoyi Zhang, Jun Wang, Gang Chen, Yan Lu

TL;DR

UI-Evol addresses the gap between externally retrieved GUI knowledge and real task execution by introducing a two-stage knowledge evolution pipeline: Retrace extracts faithful objective action sequences from actual agent interactions, and Critique analyzes deviations against external references to refine knowledge. Experiments on the OSWorld benchmark with Agent S2 show that UI-Evol improves task success rates and substantially reduces behavioral variance, addressing instability in computer-use agents. The approach demonstrates that evolved GUI knowledge better aligns with real-world environments and remains transferable across models. The work provides a practical, plug-and-play module for robust GUI-based task automation and improved reliability.

Abstract

External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90% correct knowledge yields only 41% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.

UI-Evol: Automatic Knowledge Evolving for Computer Use Agents

TL;DR

UI-Evol addresses the gap between externally retrieved GUI knowledge and real task execution by introducing a two-stage knowledge evolution pipeline: Retrace extracts faithful objective action sequences from actual agent interactions, and Critique analyzes deviations against external references to refine knowledge. Experiments on the OSWorld benchmark with Agent S2 show that UI-Evol improves task success rates and substantially reduces behavioral variance, addressing instability in computer-use agents. The approach demonstrates that evolved GUI knowledge better aligns with real-world environments and remains transferable across models. The work provides a practical, plug-and-play module for robust GUI-based task automation and improved reliability.

Abstract

External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90% correct knowledge yields only 41% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.

Paper Structure

This paper contains 24 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The green box shows web-retrieved task knowledge, while the yellow box shows evolved knowledge from our approach. Web knowledge is generally correct but often lacks practical details (left) or suggests with more complex manipulations (right).
  • Figure 2: UI-Evol consists of two stages: Retrace replays screenshots to recover objective actions; Critique uses web knowledge to detect deviations, explore alternatives, and output rationale-backed fixes that are fed back into the knowledge base.
  • Figure 3: Case study on the "capitalize every word" task from the OSWorld benchmark: Our UI-Evol first retraces the objective action sequence from the screenshots and identifies that the action taken at step $t$ was selecting only a part of the paragraph. In the Critique Stage, it detects that this action deviates from the objective, as the entire document should have been selected rather than a partial selection. Finally, our framework corrects this deviation by proposing a simpler keyboard shortcut, "Ctrl + A", instead of dragging with the mouse.