Table of Contents
Fetching ...

Computer-Using World Model

Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, Pu Zhao, Lukas Wutschitz, Samuel Kessler, Huseyin A Inan, Robert Sim, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

TL;DR

The Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action, is introduced and world-model-guided test-time scaling improves decision quality and execution robustness.

Abstract

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.

Computer-Using World Model

TL;DR

The Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action, is introduced and world-model-guided test-time scaling improves decision quality and execution robustness.

Abstract

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.
Paper Structure (44 sections, 17 equations, 7 figures, 13 tables)

This paper contains 44 sections, 17 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: UI state transitions generated by CUWM. Each row is one example transition.
  • Figure 2: Overview of the CUWM. The world model state transitions proceed in two stages, in the first stage, given the current UI state and an action, the world model predicts a textual state-transition description of the next state. In the second stage, the world model conditions on the current UI state and the transition description to render the next UI state.
  • Figure 3: Qualitative comparison of CUWM predictions and ground truth under representative UI actions, showing close alignment in both layout and panel states.
  • Figure 4: World-model-guided action selection. Given the current Excel UI state and candidate actions, CUWM correctly simulates the respective next states for each action, guiding the agent to select “Protect Workbook” based on goal alignment.
  • Figure 5: Training curve over epochs for Text Perception Score ($\uparrow$).
  • ...and 2 more figures