Table of Contents
Fetching ...

RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

Zhonghan Zhao, Wenwei Zhang, Haian Huang, Kuikun Liu, Jianfei Gao, Gaoang Wang, Kai Chen

TL;DR

RIG addresses the need for integrated reasoning and imagination in embodied agents by unifying textual reasoning, visual imagination, and low-level control within a single Transformer. It introduces a progressive data collection pipeline and two models: RIG-basic (reasoning before action) and RIG-lookahead (lookahead with imagined futures), achieving more than $17\times$ sample-efficiency over prior end-to-end baselines. In Minecraft MineRL, RIG attains state-of-the-art performance across embodied tasks, image generation, and reasoning benchmarks, with gains of $3.29\times$, $2.42\times$, and $1.33\times$, respectively, using roughly $111$ hours of data. The results demonstrate robustness, generalization, and test-time scalability via lookahead, suggesting practical impact for scalable generalist policies in open-world settings.

Abstract

Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than $17\times$ sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.

RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

TL;DR

RIG addresses the need for integrated reasoning and imagination in embodied agents by unifying textual reasoning, visual imagination, and low-level control within a single Transformer. It introduces a progressive data collection pipeline and two models: RIG-basic (reasoning before action) and RIG-lookahead (lookahead with imagined futures), achieving more than sample-efficiency over prior end-to-end baselines. In Minecraft MineRL, RIG attains state-of-the-art performance across embodied tasks, image generation, and reasoning benchmarks, with gains of , , and , respectively, using roughly hours of data. The results demonstrate robustness, generalization, and test-time scalability via lookahead, suggesting practical impact for scalable generalist policies in open-world settings.

Abstract

Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.

Paper Structure

This paper contains 26 sections, 4 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Comparison between conventional agents and RIG. RIG produces reasoning, actions, and imagination within a single Transformer.
  • Figure 2: Illustration of the data collection pipeline (S0–S4). Note that at S3 (Vision-Reviewing), we run the trained RIG-basic and policy model (STEVE-1 lifshitz2023steve) in parallel, keeping instances where RIG-basic performs poorly compared to STEVE-1.
  • Figure 3: Inference process in RIG. RIG follows a structured conversation flow through multi-turn interactions. It consistently uses the fixed word Imagine: to clearly separate internally imagined scenarios from real observations, thereby guiding coherent reasoning, action prediction, and visual imagination.
  • Figure 4: Performance and data-efficiency comparison. RIG-basic significantly outperforms other baselines with higher sample efficiency and achieves superior performance using only 111 hours of training data (42h S0 MineRL-V0 and 69h S1-S4). MineDreamer zhou2024minedreamer, a hybrid-system model, separately trains a visual generation model (139 hours) but also relies on VPT for the policy model, increasing total data requirements. Duration of VPT openai2022vpt reflects only the IDM data used, measured as video frames, while STEVE-1 lifshitz2023steve and Jarvis-1 wang2023jarvis also leverage the VPT dataset.
  • Figure 5: Comparison with various baselines across embodied tasks, generation, understanding, and reasoning. RIG-basic incorporates reasoning without reviewing, while RIG-lookahead integrates both reasoning and reviewing capabilities.
  • ...and 6 more figures