Table of Contents
Fetching ...

Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

Bao Shu, Yan Cai, Jianjian Sun, Chunrui Han, En Yu, Liang Zhao, Jingcheng Hu, Yinmin Zhang, Haoran Lv, Yuang Peng, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Xiangyu Yue

TL;DR

WMAct tackles the challenge of enabling robust world-model reasoning in LLM agents without constraining cognitive flexibility. It introduces thinking-by-doing via multi-turn interaction and two core mechanisms—reward rescaling and interaction-frequency annealing—to internalize environmental dynamics. Empirical results across Sokoban, Maze, and Taxi show that WMAct achieves strong single-turn competence, outperforms rigid and interactive baselines, and transfers to diverse reasoning benchmarks. The work offers a practical path toward efficient, long-horizon planning in embodied LLM systems and highlights the importance of adaptive interaction strategies for internalizing world models.

Abstract

Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi-turn interaction offers a superior understanding of environmental dynamics via authentic feedback, current approaches often impose a rigid reasoning process, which constrains the model's active learning, ultimately hindering efficient world model reasoning. To address these issues, we explore world-model internalization through efficient interaction and active reasoning (WMAct), which liberates the model from structured reasoning, allowing the model to shape thinking directly through its doing, and achieves effective and efficient world model reasoning with two key mechanisms: (1) a reward rescaling mechanism adjusting outcome reward based on action efficacy to incentivize redundancy reduction and purposeful interaction; (2) an interaction frequency annealing strategy to progressively reduce the maximum allowed interaction turns, which compels the model to condense its learning and internalize environmental dynamics rather than over-relying on environmental cues. Our experiments on Sokoban, Maze, and Taxi show that WMAct yields effective world model reasoning capable of resolving tasks in a single turn that previously required multiple interactions and fosters strong transferability to complex environments, improving performance on a suite of reasoning benchmarks.

Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

TL;DR

WMAct tackles the challenge of enabling robust world-model reasoning in LLM agents without constraining cognitive flexibility. It introduces thinking-by-doing via multi-turn interaction and two core mechanisms—reward rescaling and interaction-frequency annealing—to internalize environmental dynamics. Empirical results across Sokoban, Maze, and Taxi show that WMAct achieves strong single-turn competence, outperforms rigid and interactive baselines, and transfers to diverse reasoning benchmarks. The work offers a practical path toward efficient, long-horizon planning in embodied LLM systems and highlights the importance of adaptive interaction strategies for internalizing world models.

Abstract

Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi-turn interaction offers a superior understanding of environmental dynamics via authentic feedback, current approaches often impose a rigid reasoning process, which constrains the model's active learning, ultimately hindering efficient world model reasoning. To address these issues, we explore world-model internalization through efficient interaction and active reasoning (WMAct), which liberates the model from structured reasoning, allowing the model to shape thinking directly through its doing, and achieves effective and efficient world model reasoning with two key mechanisms: (1) a reward rescaling mechanism adjusting outcome reward based on action efficacy to incentivize redundancy reduction and purposeful interaction; (2) an interaction frequency annealing strategy to progressively reduce the maximum allowed interaction turns, which compels the model to condense its learning and internalize environmental dynamics rather than over-relying on environmental cues. Our experiments on Sokoban, Maze, and Taxi show that WMAct yields effective world model reasoning capable of resolving tasks in a single turn that previously required multiple interactions and fosters strong transferability to complex environments, improving performance on a suite of reasoning benchmarks.

Paper Structure

This paper contains 21 sections, 5 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustration about the difference between monolithic reasoning (Thinking Only) and multi-turn interaction approach (Thinking by Doing). Top: the agent relies on monolithic reasoning and internal simulation to plan a path. This strategy imposes a substantial cognitive burden without interaction, risking the reinforcement of erroneous internal knowledge and ultimately leading to failure. Bottom: Multi-turn interaction avoids pitfalls of internal simulation, allowing for continuous path validation and correction, resulting in successful completion.
  • Figure 1: Environment Visualization. For each environment, the left panel shows the standardized character map for agent observation, and the right panel presents the visual illustration for human intuition and inspection.
  • Figure 2: Illustration about the evolution of the model's behavior in the Maze example. As training progresses, the model's single-turn accuracy continually improves, ultimately matching multi-turn accuracy. Tasks that previously necessitated multi-turn interactive trial-and-error can now be effectively solved within a single turn. This progression illustrates a transition from multi-turn, reactive to single-turn, proactive planning. The model first relies on step-by-step interaction but later enhances its long-range planning capabilities, allowing it to internalize the complex interactive strategy, thereby improving efficiency and computational economy without sacrificing accuracy.
  • Figure 2: Reasoning Trace Visualization: A qualitative comparison of PPO-Entire versus WMAct in a Maze task. The top (PPO-Entire) demonstrates a more rigid, often enumeration-based or pre-computed planning approach, which frequently leads to suboptimal trajectories or impasses due to insufficient intermediate reflection and a lack of dynamic adaptation. In contrast, the bottom (WMAct) showcases the emergent interactive thinking and planning patterns of our model. Each step highlights WMAct's ability to analyze the current state, reflect on recent moves, and dynamically adjust its strategy, resulting in more robust self-correction mechanisms and efficient goal achievement.
  • Figure 3: Comparison between different interactive strategies.
  • ...and 4 more figures