Table of Contents
Fetching ...

R-WoM: Retrieval-augmented World Model For Computer-use Agents

Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, Jiarong Jiang

TL;DR

The paper addresses the challenge of using large language models as world models for computer-use agents, showing that while LLMs capture short-term state changes, they struggle with long-horizon planning due to hallucination and stale knowledge. It introduces R-WoM, a retrieval-augmented framework that grounds LLM simulations with environment-specific tutorials via a reasoning-based retrieval pipeline, long-chain-of-thought rollouts, and listwise reward ranking. Empirical results on OSWorld and WebArena demonstrate substantial improvements over baselines, particularly for longer-horizon tasks, validating grounding as a key lever for stable, reliable imagined dynamics. The work highlights practical implications for deploying LLM-based world models in dynamic GUI and browser environments and lays groundwork for tutorial synthesis and efficiency improvements.

Abstract

Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.

R-WoM: Retrieval-augmented World Model For Computer-use Agents

TL;DR

The paper addresses the challenge of using large language models as world models for computer-use agents, showing that while LLMs capture short-term state changes, they struggle with long-horizon planning due to hallucination and stale knowledge. It introduces R-WoM, a retrieval-augmented framework that grounds LLM simulations with environment-specific tutorials via a reasoning-based retrieval pipeline, long-chain-of-thought rollouts, and listwise reward ranking. Empirical results on OSWorld and WebArena demonstrate substantial improvements over baselines, particularly for longer-horizon tasks, validating grounding as a key lever for stable, reliable imagined dynamics. The work highlights practical implications for deploying LLM-based world models in dynamic GUI and browser environments and lays groundwork for tutorial synthesis and efficiency improvements.

Abstract

Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.

Paper Structure

This paper contains 27 sections, 11 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Example task: "Copy the screenshot 1.png from the desktop to where my cursor is located." (Left:) Using only internal world knowledge, the agent loses cursor location and gets stuck. (Right:) With grounded world knowledge from tutorials, the agent uses the correct "Insert Image" operation while maintaining cursor position. This illustrates how grounding with external knowledge enables more reliable decision-making in realistic environments.
  • Figure 2: Overview of the R-WoM pipeline. At each time step $i$, the policy model generates $m$ candidate actions. For each candidate, the world model grounded by retrieved tutorials performs $k$-step rollouts to simulate a possible future trajectory. The rewards of rollout trajectories are finally estimated by world models to select the best action.
  • Figure 3: Performance under different grounding settings, where we compare ungrounded world model: WoM, world model grounded with retrieved tutorials: R-WoM, and world model grounded with oracle tutorials: R-WoM (oracle).
  • Figure 4: Success rates (%) across imagination horizons on OSWorld (a) and WebArena (b). R-WoM (green, solid) consistently outperforms WoM (red, dashed) and reaches its peak at larger imagination horizon (at horizon around 3), indicating that grounding benefits world models in simulations over longer horizons.
  • Figure 5: Illustration of the next-state identification probing task. Given a current state and an action, the model must choose between two candidate next states: (A) the ground-truth state, and (B) a lexically similar distractor. This task evaluates whether the world model can correctly predict the true next observation rather than being misled by textual similarity.
  • ...and 3 more figures