Table of Contents
Fetching ...

OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, Heng Ji

TL;DR

A GUI-based depth-first search (GUI-DFS) exploration algorithm is introduced to comprehensively explore and verify an environment's unit functions and helps environment-learned agents take a meaningful step toward expert-level computer use.

Abstract

General-purpose computer-use agents have shown impressive performance across diverse digital environments. However, our new benchmark, OSExpert-Eval, indicates they remain far less helpful than human experts. Although inference-time scaling enables adaptation, these agents complete complex tasks inefficiently with degraded performance, transfer poorly to unseen UIs, and struggle with fine-grained action sequences. To solve the problem, we introduce a GUI-based depth-first search (GUI-DFS) exploration algorithm to comprehensively explore and verify an environment's unit functions. The agent then exploits compositionality between unit skills to self-construct a curriculum for composite tasks. To support fine-grained actions, we curate a database of action primitives for agents to discover during exploration; these are saved as a skill set once the exploration is complete. We use the learned skills to improve the agent's performance and efficiency by (1) enriching agents with ready-to-use procedural knowledge, allowing them to plan only once for long trajectories and generate accurate actions, and (2) enabling them to end inference-time scaling earlier by realizing their boundary of capabilities. Extensive experiments show that our environment-learned agent takes a meaningful step toward expert-level computer use, achieving a around 20 percent performance gain on OSExpert-Eval and closing the efficiency gap to humans by around 80 percent

OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

TL;DR

A GUI-based depth-first search (GUI-DFS) exploration algorithm is introduced to comprehensively explore and verify an environment's unit functions and helps environment-learned agents take a meaningful step toward expert-level computer use.

Abstract

General-purpose computer-use agents have shown impressive performance across diverse digital environments. However, our new benchmark, OSExpert-Eval, indicates they remain far less helpful than human experts. Although inference-time scaling enables adaptation, these agents complete complex tasks inefficiently with degraded performance, transfer poorly to unseen UIs, and struggle with fine-grained action sequences. To solve the problem, we introduce a GUI-based depth-first search (GUI-DFS) exploration algorithm to comprehensively explore and verify an environment's unit functions. The agent then exploits compositionality between unit skills to self-construct a curriculum for composite tasks. To support fine-grained actions, we curate a database of action primitives for agents to discover during exploration; these are saved as a skill set once the exploration is complete. We use the learned skills to improve the agent's performance and efficiency by (1) enriching agents with ready-to-use procedural knowledge, allowing them to plan only once for long trajectories and generate accurate actions, and (2) enabling them to end inference-time scaling earlier by realizing their boundary of capabilities. Extensive experiments show that our environment-learned agent takes a meaningful step toward expert-level computer use, achieving a around 20 percent performance gain on OSExpert-Eval and closing the efficiency gap to humans by around 80 percent
Paper Structure (39 sections, 1 equation, 5 figures, 4 tables, 2 algorithms)

This paper contains 39 sections, 1 equation, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Our OSExpert-Eval shows that current computer-use agents remain far from expert-level: they struggle with long-horizon tasks, generalize poorly to unseen UI designs, lack fine-grained control over action sequences, and still fall well short of human expert efficiency.
  • Figure 2: Up: Current general-purpose computer-use agents rely on inference-time scaling, yet remain prone to failures and high latency. Left: Prior approaches explore digital environments using human-curated queries or tutorial-derived queries, which are often unavailable or difficult to obtain for arbitrary environments. Right: Our framework does not require external data or human effort for exploration queries and more comprehensively discover the unit functions of the digital environment, and benefits both performance and efficiency. We introduce how we handle the fine-grained actions during the exploration and how we organize the learned skill set in Figure \ref{['fig:fine-grained']}.
  • Figure 3: Left: How our framework organizes and utilize the self-constructed skill set for robust and efficient inference. The unit functions are obtained from the terminal states as shown in Figure \ref{['fig:comprehensive']}. Right: How we handled potential fine-grained actions during exploration stage. The fine-grained action handling is usually triggered by an error state in the exploration, as shown in Figure \ref{['fig:comprehensive']}. The selected primitive fine-grained action template will be added to the skill set for solving future queries if verified helpful.
  • Figure 4: Composition of our evaluation tasks (113 total). The inner ring shows the three high-level categories (Unseen UI, Fine-Grained, and Long Horizon), while the outer ring breaks each category down by environment (Tableau, MiniWord, Office, and GIMP); slice sizes are proportional to the number of tasks. Here, Office includes LibreOffice Writer, LibreOffice Impress, and LibreOffice Calc.
  • Figure 5: Representative examples from OSExpert-Eval across three task categories. The figure aggregates six examples illustrating the breadth of professional computer-use skills evaluated in our benchmark. Top (Long-Horizon Compositional Workflows): multi-step tasks in LibreOffice Calc and Writer that require composing several unit operations in the correct order, including spreadsheet completion with interface adjustments (e.g., zoom) and document-wide formatting with image insertion. Middle (Fine-Grained Action Execution): precise image-editing tasks in GIMP, including tightly cropping to a specified 200×200 region and performing accurate background removal while preserving object integrity. For each example, we show the initial environment state, the natural-language instruction, and the corresponding ground-truth outcome. Bottom (Unseen UI Generalization): tasks in Tableau and MiniWord that test transfer to unfamiliar interfaces and interaction patterns, such as building a world map visualization with sales-based coloring and category filtering, and importing external text followed by image insertion and alignment in a novel editor layout.