Table of Contents
Fetching ...

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie

TL;DR

Optimus-1 tackles open-world, long-horizon tasks by introducing Hybrid Multimodal Memory, combining a Hierarchical Directed Knowledge Graph (HDKG) for structured world knowledge with an Abstracted Multimodal Experience Pool (AMEP) for summarised multimodal experiences. A Knowledge-Guided Planner and Experience-Driven Reflector work with an Action Controller to plan, execute, and reflect, enabling one-shot planning guided by HDKG and in-context learning via AMEP. The memory is non-parametric and plug-and-play, allowing various Multimodal Large Language Models (MLLMs) to serve as backbones, achieving 2–6× generalization gains and near human-level performance on many Minecraft long-horizon tasks without additional parameter updates. The work demonstrates a scalable path toward general-purpose agents that learn from both knowledge and experience, with a self-evolution capability that expands memory over time and tasks.

Abstract

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

TL;DR

Optimus-1 tackles open-world, long-horizon tasks by introducing Hybrid Multimodal Memory, combining a Hierarchical Directed Knowledge Graph (HDKG) for structured world knowledge with an Abstracted Multimodal Experience Pool (AMEP) for summarised multimodal experiences. A Knowledge-Guided Planner and Experience-Driven Reflector work with an Action Controller to plan, execute, and reflect, enabling one-shot planning guided by HDKG and in-context learning via AMEP. The memory is non-parametric and plug-and-play, allowing various Multimodal Large Language Models (MLLMs) to serve as backbones, achieving 2–6× generalization gains and near human-level performance on many Minecraft long-horizon tasks without additional parameter updates. The work demonstrates a scalable path toward general-purpose agents that learn from both knowledge and experience, with a self-evolution capability that expands memory over time and tasks.

Abstract

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.
Paper Structure (39 sections, 4 equations, 13 figures, 16 tables)

This paper contains 39 sections, 4 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: An illustration of Optimus-1 performing long-horizon tasks in Minecraft. Given the task "Craft stone sword", Knowledge-Guided Planner incorporates knowledge from Hierarchical Directed Knowledge Graph into planning, then Action Controller executes these planning sequences step-by-step. During the execution of the task, the Experience-Driven Reflector is periodically activated and retrieve experience from Abstracted Multimodal Experience Pool to make reflection.
  • Figure 2: (a) Extraction process of multimodal experience. The frames are filtered through video buffer and image buffer, then MineCLIP fan2022minedojo is employed to compute the visual and sub-goal similarities and finally they are stored in Abstracted Multimodal Experience Pool. (b) Overview of Hierarchical Directed Knowledge Graph. Knowledge is stored as a directed graph, where its nodes represent objects, and directed edges point to materials that can be crafted by this object.
  • Figure 3: Overview framework of our Optimus-1. Optimus-1 consists of Knowledge-Guided Planner, Experience-Driven Reflector, Action Controller, and Hybrid Multimodal Memory architecture. Given the task "craft stone sword", Optimus-1 incorporates the knowledge from HDKG into Knowledge-Guided Planning, then Action Controller generates low-level actions. Experience-Driven Reflector is periodically activated to introduce multimodal experience from AMEP to determine if the current task can be executed successfully. If not, it will ask the Knowledge-Guided Planner to refine the plan.
  • Figure 4: Illustration of the role of reflection mechanism. Without the help of reflective mechanisms, STEVE-1 lifshitz2024steve often gets into trouble and fails to complete the task. While Optimus-1, with the help of the Experience-Driven Reflector, leverages the AMEP to retrieve relevant experience, reflect current situation and correct errors. This improves Optimus-1's success rate on long-horizon tasks.
  • Figure 5: (a) With the help of Hybrid Multimodal Memory, various MLLM-based Optimus-1 have demonstrated 2 to 6 times performance improvement. (b) Illustration of the change in Optimus-1 success rate on the unseen task over 4 epochs.
  • ...and 8 more figures