Table of Contents
Fetching ...

Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, Liqiang Nie

TL;DR

Optimus-3 tackles open-world embodied AI in Minecraft by unifying fast reflex actions (System 1) with deliberative reasoning (System 2) in a single end-to-end framework. It introduces a knowledge-enhanced data generation pipeline (OptimusM$^4$), a Dual-Router Aligned MoE architecture for horizontal task decoupling and vertical adaptive depth, and Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) to supervise thinking and answering with dense rewards. The approach yields substantial gains across Planning, Captioning, Embodied QA, Grounding, and Reflection, plus strong improvements in long-horizon actions and open-ended task success (average 60% on open-ended tasks). The work also highlights the importance of domain knowledge, task-aware routing, and fine-grained reasoning supervision for robust multi-modal, open-ended intelligence in dynamic environments, and releases the OptimusM$^4$ dataset to the community.

Abstract

Developing generalist agents capable of solving open-ended tasks in visually rich, dynamic environments remains a core pursuit of embodied AI. While Minecraft has emerged as a compelling benchmark, existing agents often suffer from fragmented cognitive abilities, lacking the synergy between reflexive execution (System 1) and deliberative reasoning (System 2). In this paper, we introduce Optimus-3, a generalist agent that organically integrates these dual capabilities within a unified framework. To achieve this, we address three fundamental challenges. First, to overcome the scarcity of reasoning data, we propose a Knowledge-Enhanced Automated Data Generation Pipeline. It synthesizes high-quality System 2 reasoning traces from raw System 1 interaction trajectories, effectively mitigating hallucinations via injection of domain knowledge. We release the resulting dataset, \textbf{OptimusM$^{4}$}, to the community. Second, to reconcile the dichotomous computational requirements of the dual systems, we design a Dual-Router Aligned MoE Architecture. It employs a Task Router to prevent task interference via parameter decoupling, and a Layer Router to dynamically modulate reasoning depth, creating a computational ``Fast Path'' for System 1 and a ``Deep Path'' for System 2. Third, to activate the reasoning capabilities of System 2, we propose Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) algorithm. It enforces Process-Outcome Co-Supervision via dual-granularity dense rewards, ensuring consistency between the thought process and the answer. Extensive evaluations demonstrate that Optimus-3 surpasses existing state-of-the-art methods on both System~2 (21$\%$ on Planning, 66\% on Captioning, 76\% on Embodied QA, 3.4$\times$ on Grounding, and 18\% on Reflection) and System~1 (3\% on Long-Horizon Action) tasks, with a notable 60\% success rate on open-ended tasks.

Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

TL;DR

Optimus-3 tackles open-world embodied AI in Minecraft by unifying fast reflex actions (System 1) with deliberative reasoning (System 2) in a single end-to-end framework. It introduces a knowledge-enhanced data generation pipeline (OptimusM), a Dual-Router Aligned MoE architecture for horizontal task decoupling and vertical adaptive depth, and Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) to supervise thinking and answering with dense rewards. The approach yields substantial gains across Planning, Captioning, Embodied QA, Grounding, and Reflection, plus strong improvements in long-horizon actions and open-ended task success (average 60% on open-ended tasks). The work also highlights the importance of domain knowledge, task-aware routing, and fine-grained reasoning supervision for robust multi-modal, open-ended intelligence in dynamic environments, and releases the OptimusM dataset to the community.

Abstract

Developing generalist agents capable of solving open-ended tasks in visually rich, dynamic environments remains a core pursuit of embodied AI. While Minecraft has emerged as a compelling benchmark, existing agents often suffer from fragmented cognitive abilities, lacking the synergy between reflexive execution (System 1) and deliberative reasoning (System 2). In this paper, we introduce Optimus-3, a generalist agent that organically integrates these dual capabilities within a unified framework. To achieve this, we address three fundamental challenges. First, to overcome the scarcity of reasoning data, we propose a Knowledge-Enhanced Automated Data Generation Pipeline. It synthesizes high-quality System 2 reasoning traces from raw System 1 interaction trajectories, effectively mitigating hallucinations via injection of domain knowledge. We release the resulting dataset, \textbf{OptimusM}, to the community. Second, to reconcile the dichotomous computational requirements of the dual systems, we design a Dual-Router Aligned MoE Architecture. It employs a Task Router to prevent task interference via parameter decoupling, and a Layer Router to dynamically modulate reasoning depth, creating a computational ``Fast Path'' for System 1 and a ``Deep Path'' for System 2. Third, to activate the reasoning capabilities of System 2, we propose Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) algorithm. It enforces Process-Outcome Co-Supervision via dual-granularity dense rewards, ensuring consistency between the thought process and the answer. Extensive evaluations demonstrate that Optimus-3 surpasses existing state-of-the-art methods on both System~2 (21 on Planning, 66\% on Captioning, 76\% on Embodied QA, 3.4 on Grounding, and 18\% on Reflection) and System~1 (3\% on Long-Horizon Action) tasks, with a notable 60\% success rate on open-ended tasks.

Paper Structure

This paper contains 19 sections, 11 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Given the task Craft a diamond sword based on the current inventory, Optimus-3 employs Captioning to perceive and interpret the inventory information, Grounding to select appropriate tools, Planning to generate sub-goals based on available materials, Action to execute these sub-goals sequentially, Reflection to assess the current task state, and Embodied QA to verify whether the task has been successfully completed.
  • Figure 2: (A): Overview of Optimus-3. Given observations and instructions, Optimus-3 couples System-1 fast reaction (Action) and System-2 deliberate reasoning (Embodied QA, Planning, Grounding, Reflection) within the Dual-Router Aligned MoE architecture. (B): The details of Dual-Router Aligned MoE architecture. Horizontally, Task Router assigns each input to its corresponding task expert together with a shared knowledge expert. Vertically, Layer Router accelerates latency-sensitive action inference by selectively skipping intermediate layers. Both routing decisions are made once before the forward pass. (C): Performance comparison of Optimus-3 against current task-specific SOTA agents, GPT-4o gpt4, and Qwen2.5-VL bai2025qwen2.
  • Figure 3: Different agent framework in Minecraft. (A) Goal-conditioned policy which based on Transformer-XL architecture. (B) Function calling, which employs LLM to generate executable functions. (C) (M)LLM as the high-level planner, which then employs a goal-conditioned policy to generate low-level actions. (D) MLLM generates latent tokens that serve as conditioning inputs for the policy. (E) End-to-end MoE architecture (Ours) which endowed with multi-dimensional capabilities.
  • Figure 4: Knowledge-Enhanced Data Generation Pipeline. The knowledge source is in green. Given a task pool, we utilize a knowledge graph li2024optimus to generate task plans, forming the planning dataset. These plans are then used as instructions for STEVE-1 lifshitz2024steve, which interacts with the environment to produce the action dataset. During this process, we randomly sample images and employ expert models lu2024deepseekliu2024grounding with environmental feedback to generate the captioning, embodied QA, and grounding datasets.
  • Figure 5: Data visualization and statistics of the OptimusM$^4$ dataset. Top: Representative samples across the Planning, Captioning, Embodied QA, Grounding, Reflection, and Action tasks. Bottom: Statistical overview, detailing sample counts, biome distribution, and tech-level distribution for Action data.
  • ...and 7 more figures