Reinforcement World Model Learning for LLM-based Agents

Xiao Yu; Baolin Peng; Ruize Xu; Yelong Shen; Pengcheng He; Suman Nath; Nikhil Singh; Jiangfeng Gao; Zhou Yu

Reinforcement World Model Learning for LLM-based Agents

Xiao Yu, Baolin Peng, Ruize Xu, Yelong Shen, Pengcheng He, Suman Nath, Nikhil Singh, Jiangfeng Gao, Zhou Yu

TL;DR

RWML introduces a self-supervised method to endow LLM-based agents with robust world knowledge by learning action-conditioned transitions through sim-to-real alignment in a pretrained embedding space. By optimizing a cosine-similarity reward between predicted and observed next states via GRPO, RWML avoids brittle next-token prediction and scales without expert data. Empirical results on ALFWorld and τ^2 Bench show substantial performance gains, and when combined with task-success RL, RWML surpasses baselines and matches expert-data training, while exhibiting less forgetting and more stable parameter updates. The approach demonstrates that strengthening internal world models prior to policy RL can markedly improve long-horizon decision-making in agentic settings. This work paves the way for scalable, self-supervised pretraining of environment understanding to enhance LLM-driven agents in complex tasks.

Abstract

Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $τ^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $τ^2$ Bench respectively, while matching the performance of expert-data training.

Reinforcement World Model Learning for LLM-based Agents

TL;DR

Abstract

Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and

Bench respectively, while matching the performance of expert-data training.

Paper Structure (33 sections, 7 equations, 6 figures, 12 tables)

This paper contains 33 sections, 7 equations, 6 figures, 12 tables.

Introduction
Method
Notation
Reinforcement World Model Learning
Experiments
Experiment Setup
Benchmarks
Baselines
Models and Training Data
Main Results
RWML Forgets Less
Ablation Studies
Discussion
Impact of RWML on Decision-Making
Weight Change Analysis
...and 18 more sections

Figures (6)

Figure 1: We propose RWML as a scalable, self-supervised method to improve the world modeling ability of LLM-based agent by learning from next-states, prior to downstream policy RL which learns from task-success reward.
Figure 2: Overview of RWML. Given a target model $\pi_\theta$, we first collect training data for RWML by using $\pi_\theta$ to gather rollouts $(s_0, a_0, s_1, a_1, ... s_T)$ with the environment, and then convert these rollouts into $\left\langle s_{\le t}, a_t, s_{t+1} \right\rangle$ triplets for all $t$, after subsampling "too easy" samples defined in \ref{['eq:too_easy_eq']}. We then train $\pi_\theta$ to reason as a world model via GRPO, using lightweight reward functions (e.g., embedding-based cosine similarity) to compare the predicted $\hat{s}_{t+1}$ with the real $s_{t+1}$.
Figure 3: Comparing parameter change ratios per layer across models trained with different algorithms. We find WM SFT-trained models shows significantly more parameter change compare to RWML and Policy RL, potentially contributing to model forgetting in \ref{['subsec:Forgetting']}.
Figure 4: RWML training with different base models on $\tau^2$ Bench.
Figure 5: After RWML, models produce more accurate and efficient decisions by leveraging its improved knowledge of the environment.
...and 1 more figures

Reinforcement World Model Learning for LLM-based Agents

TL;DR

Abstract

Reinforcement World Model Learning for LLM-based Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (6)