Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

Zaijing Li; Yuquan Xie; Rui Shao; Gongwei Chen; Dongmei Jiang; Liqiang Nie

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie

TL;DR

Optimus-2 addresses the core challenge of learning human-like behavior in open-world tasks by jointly modeling observations, actions, and language. It combines an MLLM-based task planner with a GOAP policy that uses an Action-guided Behavior Encoder and a memory-augmented, long-horizon aware architecture to predict actions from open-ended goals. The authors introduce the MGOA dataset, a large-scale, automated collection of goal-observation-action triples, to train GOAP and enable robust learning across atomic, long-horizon, and instruction-following tasks. Experimental results show that Optimus-2 outperforms prior state-of-the-art planners and policies across all task categories, supported by ablations and visualizations that highlight the importance of causalObservation-action modeling and language grounding. The work advances open-world agent capabilities in Minecraft and contributes a scalable data-generation pipeline for future research.

Abstract

Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA)} dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft. Please see the project page at https://cybertronagent.github.io/Optimus-2.github.io/.

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

TL;DR

Abstract

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)