Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

Zaijing Li; Yuquan Xie; Rui Shao; Gongwei Chen; Weili Guan; Dongmei Jiang; Yaowei Wang; Liqiang Nie

Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, Liqiang Nie

TL;DR

Optimus-3 tackles open-world embodied AI in Minecraft by unifying fast reflex actions (System 1) with deliberative reasoning (System 2) in a single end-to-end framework. It introduces a knowledge-enhanced data generation pipeline (OptimusM$^4$), a Dual-Router Aligned MoE architecture for horizontal task decoupling and vertical adaptive depth, and Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) to supervise thinking and answering with dense rewards. The approach yields substantial gains across Planning, Captioning, Embodied QA, Grounding, and Reflection, plus strong improvements in long-horizon actions and open-ended task success (average 60% on open-ended tasks). The work also highlights the importance of domain knowledge, task-aware routing, and fine-grained reasoning supervision for robust multi-modal, open-ended intelligence in dynamic environments, and releases the OptimusM$^4$ dataset to the community.

Abstract

Developing generalist agents capable of solving open-ended tasks in visually rich, dynamic environments remains a core pursuit of embodied AI. While Minecraft has emerged as a compelling benchmark, existing agents often suffer from fragmented cognitive abilities, lacking the synergy between reflexive execution (System 1) and deliberative reasoning (System 2). In this paper, we introduce Optimus-3, a generalist agent that organically integrates these dual capabilities within a unified framework. To achieve this, we address three fundamental challenges. First, to overcome the scarcity of reasoning data, we propose a Knowledge-Enhanced Automated Data Generation Pipeline. It synthesizes high-quality System 2 reasoning traces from raw System 1 interaction trajectories, effectively mitigating hallucinations via injection of domain knowledge. We release the resulting dataset, \textbf{OptimusM$^{4}$}, to the community. Second, to reconcile the dichotomous computational requirements of the dual systems, we design a Dual-Router Aligned MoE Architecture. It employs a Task Router to prevent task interference via parameter decoupling, and a Layer Router to dynamically modulate reasoning depth, creating a computational ``Fast Path'' for System 1 and a ``Deep Path'' for System 2. Third, to activate the reasoning capabilities of System 2, we propose Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) algorithm. It enforces Process-Outcome Co-Supervision via dual-granularity dense rewards, ensuring consistency between the thought process and the answer. Extensive evaluations demonstrate that Optimus-3 surpasses existing state-of-the-art methods on both System~2 (21$\%$ on Planning, 66\% on Captioning, 76\% on Embodied QA, 3.4$\times$ on Grounding, and 18\% on Reflection) and System~1 (3\% on Long-Horizon Action) tasks, with a notable 60\% success rate on open-ended tasks.

Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

TL;DR

), a Dual-Router Aligned MoE architecture for horizontal task decoupling and vertical adaptive depth, and Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) to supervise thinking and answering with dense rewards. The approach yields substantial gains across Planning, Captioning, Embodied QA, Grounding, and Reflection, plus strong improvements in long-horizon actions and open-ended task success (average 60% on open-ended tasks). The work also highlights the importance of domain knowledge, task-aware routing, and fine-grained reasoning supervision for robust multi-modal, open-ended intelligence in dynamic environments, and releases the OptimusM

dataset to the community.

Abstract

}, to the community. Second, to reconcile the dichotomous computational requirements of the dual systems, we design a Dual-Router Aligned MoE Architecture. It employs a Task Router to prevent task interference via parameter decoupling, and a Layer Router to dynamically modulate reasoning depth, creating a computational ``Fast Path'' for System 1 and a ``Deep Path'' for System 2. Third, to activate the reasoning capabilities of System 2, we propose Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) algorithm. It enforces Process-Outcome Co-Supervision via dual-granularity dense rewards, ensuring consistency between the thought process and the answer. Extensive evaluations demonstrate that Optimus-3 surpasses existing state-of-the-art methods on both System~2 (21

on Planning, 66\% on Captioning, 76\% on Embodied QA, 3.4

on Grounding, and 18\% on Reflection) and System~1 (3\% on Long-Horizon Action) tasks, with a notable 60\% success rate on open-ended tasks.

Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

TL;DR

Abstract

Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)