Table of Contents
Fetching ...

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie

TL;DR

Optimus-2 addresses the core challenge of learning human-like behavior in open-world tasks by jointly modeling observations, actions, and language. It combines an MLLM-based task planner with a GOAP policy that uses an Action-guided Behavior Encoder and a memory-augmented, long-horizon aware architecture to predict actions from open-ended goals. The authors introduce the MGOA dataset, a large-scale, automated collection of goal-observation-action triples, to train GOAP and enable robust learning across atomic, long-horizon, and instruction-following tasks. Experimental results show that Optimus-2 outperforms prior state-of-the-art planners and policies across all task categories, supported by ablations and visualizations that highlight the importance of causalObservation-action modeling and language grounding. The work advances open-world agent capabilities in Minecraft and contributes a scalable data-generation pipeline for future research.

Abstract

Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA)} dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft. Please see the project page at https://cybertronagent.github.io/Optimus-2.github.io/.

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

TL;DR

Optimus-2 addresses the core challenge of learning human-like behavior in open-world tasks by jointly modeling observations, actions, and language. It combines an MLLM-based task planner with a GOAP policy that uses an Action-guided Behavior Encoder and a memory-augmented, long-horizon aware architecture to predict actions from open-ended goals. The authors introduce the MGOA dataset, a large-scale, automated collection of goal-observation-action triples, to train GOAP and enable robust learning across atomic, long-horizon, and instruction-following tasks. Experimental results show that Optimus-2 outperforms prior state-of-the-art planners and policies across all task categories, supported by ablations and visualizations that highlight the importance of causalObservation-action modeling and language grounding. The work advances open-world agent capabilities in Minecraft and contributes a scalable data-generation pipeline for future research.

Abstract

Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA)} dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft. Please see the project page at https://cybertronagent.github.io/Optimus-2.github.io/.

Paper Structure

This paper contains 31 sections, 8 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: Left: General agent framework. Right: Comparison between existing goal-conditioned policies and ours. Existing Transformer-XL-based policies cai2023grootlifshitz2024steve exhibit limited natural language understanding capabilities and rely solely on combining implicit goal embeddings with visual embeddings as inputs. In contrast, our GOAP achieves superior action prediction by 1) employing an Action-guided behavior encoder to strengthen causal modeling between observations and actions, as well as to improve historical sequence modeling capabilities, and 2) leveraging MLLM to enhance open-ended language comprehension.
  • Figure 2: Overview of Optimus-2. Given a task and the current observation, Optimus-2 first uses an MLLM-based Planner to generate a series of sub-goals. Optimus-2 then sequentially executes these sub-goals through GOAP. GOAP obtains behavior tokens for the current timestep via the Action-guided Behavior Encoder, and these behavior tokens, along with image and text tokens, are fed into the LLM to predict subsequent actions.
  • Figure 3: An illustration of VPT (text) vpt, STEVE-1 lifshitz2024steve, and Optimus-2 executing the open-ended instruction, "I need some iron ores, what should I do?". Existing policies are limited by their instruction comprehension abilities and thus fail to complete the task, whereas GOAP leverages the language understanding capabilities of the MLLM, enabling it to accomplish the task.
  • Figure 4: Ablation of LLM backbone on Open-ended Instruction Tasks, Golden Shovel , Diamond Pickaxe , and Compass .
  • Figure 5: Ablation study on Training data. OCD refers to the OpenAI Contractor Dataset vpt. We report the average rewards on each Atomic Task.
  • ...and 9 more figures