Table of Contents
Fetching ...

From Word to World: Can Large Language Models be Implicit Text-based World Models?

Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, Heng Ji, Mengdi Wang

TL;DR

This work investigates whether large language models can serve as implicit text-based world simulators to enhance agent learning from interaction. By formalizing world modeling as multi-turn next-state prediction and evaluating across five diverse text environments, the study demonstrates that sufficiently trained LLMs can maintain coherent latent dynamics, scale with data and capacity, and improve downstream learning via verification, synthetic data, and warm-started RL. However, gains depend on behavioral coverage and environment complexity, limiting effectiveness in open-ended settings without grounding in real observations. The results establish a foundation for treating LLMs as general-purpose simulators of interactive worlds and suggest directions toward multimodal extensions beyond text.

Abstract

Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.

From Word to World: Can Large Language Models be Implicit Text-based World Models?

TL;DR

This work investigates whether large language models can serve as implicit text-based world simulators to enhance agent learning from interaction. By formalizing world modeling as multi-turn next-state prediction and evaluating across five diverse text environments, the study demonstrates that sufficiently trained LLMs can maintain coherent latent dynamics, scale with data and capacity, and improve downstream learning via verification, synthetic data, and warm-started RL. However, gains depend on behavioral coverage and environment complexity, limiting effectiveness in open-ended settings without grounding in real observations. The results establish a foundation for treating LLMs as general-purpose simulators of interactive worlds and suggest directions toward multimodal extensions beyond text.

Abstract

Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.

Paper Structure

This paper contains 47 sections, 5 equations, 23 figures, 10 tables.

Figures (23)

  • Figure 1: LLMs as text-based world models for agent learning. (A) We formulate world modeling as next-state prediction under a fixed text-based interaction protocol. (B) Assess world-model capability along three axes: fidelity/consistency, scalability/robustness, and agent utility. (C) World model exhibits high fidelity and consistency in both single-step predictions and long-horizon rollouts. (D) Performance scales predictably with increased training data across text environments. (E) Faithful world models enhance agents via verification, synthetic data generation, and improved reinforcement learning through stronger initialization.
  • Figure 2: Next-state prediction accuracy under varying training data sizes on Qwen2.5-7B. Structured settings saturate with modest data ( 20K), whereas open-ended settings continue to benefit from larger datasets. Note. We apply a nonlinear y-axis transform $f(y) = 100 - 20 \log_{10}(\max(100 - y, 0.01) + 1)$ to better reveal growth trends.
  • Figure 3: Next-state prediction accuracy on Qwen2.5 family. Smaller models ( 1.5B) capture structured dynamics effectively, whereas more complex settings benefit markedly from increased model capacity.
  • Figure 4: Task success rate (%) in ALFWorld under different OOD settings. Success rate averaged over different agents, with full results provided in Table \ref{['appendix:tab:ood_generalization']} of Appendix \ref{['sec:appendix_full_results']}. World models maintain strong performance even when layouts or room types change.
  • Figure 5: Next-state prediction accuracy under mixed and separate training on Qwen2.5-7B, with 1K samples per environment. We begin by mixing structured environments (ALFWorld, SciWorld, TextWorld) and then progressively incorporate open-ended environments (WebShop, StableToolBench), yielding the Mix3, Mix4, and Mix5 settings.
  • ...and 18 more figures