Table of Contents
Fetching ...

Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

Xiao Yu, Baolin Peng, Michel Galley, Hao Cheng, Qianhui Wu, Janardhan Kulkarni, Suman Nath, Zhou Yu, Jianfeng Gao

TL;DR

Dyna-Mind introduces a two-stage training framework that teaches (V)LM agents to integrate environment simulations into reasoning for long-horizon, interactive tasks. Stage 1 (ReSim) uses expanded search trees from real interactions to generate simulation-guided reasoning traces and trains a policy via imitation. Stage 2 (Dyna-GRPO) applies online RL, leveraging SimRollout and future-state textual signals to refine both policy and simulation ability, yielding superior planning performance on Sokoban, ALFWorld, and AndroidWorld. Across synthetic and realistic benchmarks, the results demonstrate that robust simulation capacity correlates with improved reasoning and that combining outcome rewards with intermediate state signals enhances policy quality and generalization.

Abstract

Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.

Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

TL;DR

Dyna-Mind introduces a two-stage training framework that teaches (V)LM agents to integrate environment simulations into reasoning for long-horizon, interactive tasks. Stage 1 (ReSim) uses expanded search trees from real interactions to generate simulation-guided reasoning traces and trains a policy via imitation. Stage 2 (Dyna-GRPO) applies online RL, leveraging SimRollout and future-state textual signals to refine both policy and simulation ability, yielding superior planning performance on Sokoban, ALFWorld, and AndroidWorld. Across synthetic and realistic benchmarks, the results demonstrate that robust simulation capacity correlates with improved reasoning and that combining outcome rewards with intermediate state signals enhances policy quality and generalization.

Abstract

Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.

Paper Structure

This paper contains 41 sections, 4 equations, 6 figures, 11 tables, 3 algorithms.

Figures (6)

  • Figure 1: We find the performance of strong reasoning models is heavily affected by its ability to simulate in different environments (left). We introduce Dyna-Mind, a two-stage training framework to integrate and improve simulation ability of AI agents (right).
  • Figure 2: ReSim integrates simulation into reasoning ($a_t^{\mathrm{ReSim}}$) by using expanded search trees built through real environment interactions (left). ReSim then trains an agent to directly generate such simulation-guided reasoning trace $a_t^{\mathrm{ReSim}}$ without any algorithm support (right).
  • Figure 3: Dyna-GRPO iterates between policy improvement (left) and world model improvement (right), optimized by GRPO. During policy improvement, we perform grouped policy rollouts with GRPO. During simulation improvement, we perform both policy rollouts and simulation refinement rollouts (see \ref{['fig:selfimp_rollout_fig']}), and trains the model to directly generate an improved policy as well as to better perform simulation refinement when provided with future-states information.
  • Figure 4: SimRollout generates refined action per state $s_t$ using real environment interactions
  • Figure : Dyna-GRPO
  • ...and 1 more figures