Computer-Using World Model

Yiming Guan; Rui Yu; John Zhang; Lu Wang; Chaoyun Zhang; Liqun Li; Bo Qiao; Si Qin; He Huang; Fangkai Yang; Pu Zhao; Lukas Wutschitz; Samuel Kessler; Huseyin A Inan; Robert Sim; Saravan Rajmohan; Qingwei Lin; Dongmei Zhang

Computer-Using World Model

Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, Pu Zhao, Lukas Wutschitz, Samuel Kessler, Huseyin A Inan, Robert Sim, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

TL;DR

The Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action, is introduced and world-model-guided test-time scaling improves decision quality and execution robustness.

Abstract

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.

Computer-Using World Model

TL;DR

Abstract

Paper Structure (44 sections, 17 equations, 7 figures, 13 tables)

This paper contains 44 sections, 17 equations, 7 figures, 13 tables.

Introduction
Related Work
Method
Two-Stage World Model Architecture
Supervised Training with GPT-Annotated Transitions
Structure-Aware Reinforcement Learning for Textual Transitions
World-Model-Guided Test-Time Action Search
Experiments
Experimental Setup
World Model Fidelity
Textual State Transition Evaluation
Visual State Realization Evaluation
World-Model-Guided Test-Time Action Search
Case Study
Insights: How World Models help GUI agents
...and 29 more sections

Figures (7)

Figure 1: UI state transitions generated by CUWM. Each row is one example transition.
Figure 2: Overview of the CUWM. The world model state transitions proceed in two stages, in the first stage, given the current UI state and an action, the world model predicts a textual state-transition description of the next state. In the second stage, the world model conditions on the current UI state and the transition description to render the next UI state.
Figure 3: Qualitative comparison of CUWM predictions and ground truth under representative UI actions, showing close alignment in both layout and panel states.
Figure 4: World-model-guided action selection. Given the current Excel UI state and candidate actions, CUWM correctly simulates the respective next states for each action, guiding the agent to select “Protect Workbook” based on goal alignment.
Figure 5: Training curve over epochs for Text Perception Score ($\uparrow$).
...and 2 more figures

Computer-Using World Model

TL;DR

Abstract

Computer-Using World Model

Authors

TL;DR

Abstract

Table of Contents

Figures (7)