Table of Contents
Fetching ...

MrSteve: Instruction-Following Agents in Minecraft with What-Where-When Memory

Junyeong Park, Junmo Cho, Sungjin Ahn

TL;DR

MrSteve introduces Place Event Memory (PEM), a What-Where-When episodic memory system for a low-level Minecraft controller to overcome the short-memory bottleneck of prior agents like Steve-1. Built atop PEM, the Memory-Augmented Task Solving Framework enables switching between exploration and goal-directed execution, guided by a Count-Based high-level strategy and a goal-conditioned VPT-Nav navigator. Across extensive experiments in sparse, long-horizon, and memory-constrained settings, MrSteve yields superior exploration efficiency and faster, more reliable task solving compared to baselines, with strong performance in long-horizon scenarios. The work suggests that incorporating hierarchical, event-aware episodic memory into low-level controllers significantly enhances generalization and efficiency in embodied AI tasks, and it provides code and demos to promote reproducibility and further research.

Abstract

Significant advances have been made in developing general-purpose embodied AI in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. While these approaches, which combine high-level planners with low-level controllers, show promise, low-level controllers frequently become performance bottlenecks due to repeated failures. In this paper, we argue that the primary cause of failure in many low-level controllers is the absence of an episodic memory system. To address this, we introduce MrSteve (Memory Recall Steve), a novel low-level controller equipped with Place Event Memory (PEM), a form of episodic memory that captures what, where, and when information from episodes. This directly addresses the main limitation of the popular low-level controller, Steve-1. Unlike previous models that rely on short-term memory, PEM organizes spatial and event-based data, enabling efficient recall and navigation in long-horizon tasks. Additionally, we propose an Exploration Strategy and a Memory-Augmented Task Solving Framework, allowing agents to alternate between exploration and task-solving based on recalled events. Our approach significantly improves task-solving and exploration efficiency compared to existing methods. We will release our code and demos on the project page: https://sites.google.com/view/mr-steve.

MrSteve: Instruction-Following Agents in Minecraft with What-Where-When Memory

TL;DR

MrSteve introduces Place Event Memory (PEM), a What-Where-When episodic memory system for a low-level Minecraft controller to overcome the short-memory bottleneck of prior agents like Steve-1. Built atop PEM, the Memory-Augmented Task Solving Framework enables switching between exploration and goal-directed execution, guided by a Count-Based high-level strategy and a goal-conditioned VPT-Nav navigator. Across extensive experiments in sparse, long-horizon, and memory-constrained settings, MrSteve yields superior exploration efficiency and faster, more reliable task solving compared to baselines, with strong performance in long-horizon scenarios. The work suggests that incorporating hierarchical, event-aware episodic memory into low-level controllers significantly enhances generalization and efficiency in embodied AI tasks, and it provides code and demos to promote reproducibility and further research.

Abstract

Significant advances have been made in developing general-purpose embodied AI in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. While these approaches, which combine high-level planners with low-level controllers, show promise, low-level controllers frequently become performance bottlenecks due to repeated failures. In this paper, we argue that the primary cause of failure in many low-level controllers is the absence of an episodic memory system. To address this, we introduce MrSteve (Memory Recall Steve), a novel low-level controller equipped with Place Event Memory (PEM), a form of episodic memory that captures what, where, and when information from episodes. This directly addresses the main limitation of the popular low-level controller, Steve-1. Unlike previous models that rely on short-term memory, PEM organizes spatial and event-based data, enabling efficient recall and navigation in long-horizon tasks. Additionally, we propose an Exploration Strategy and a Memory-Augmented Task Solving Framework, allowing agents to alternate between exploration and task-solving based on recalled events. Our approach significantly improves task-solving and exploration efficiency compared to existing methods. We will release our code and demos on the project page: https://sites.google.com/view/mr-steve.

Paper Structure

This paper contains 46 sections, 3 equations, 19 figures, 11 tables, 8 algorithms.

Figures (19)

  • Figure 1: Sparse Sequential Task Solving Scenario. The first task is to obtain a log. The agent explores to find a tree. While searching, the agent observes a cow but continues focusing on acquiring the log. Once the log is obtained, the next task is to obtain a water bucket. Remembering that it already explored the forward direction while searching for the tree, the agent chooses to explore to the right. After gathering the water bucket, the final task is obtain meat, which can be acquired from the cow. Recalling the cow’s location, the agent navigates there and completes the task by obtaining the meat. Note that each task takes a few thousand steps to achieve. This scenario highlights the significance of episodic memory for efficient exploration and task-solving in an open-ended world where task-relevant resources are sparsely distributed.
  • Figure 2: MrSteve and Place Event Memory. (a) MrSteve takes agent's position, first person view, and text instruction, and utilizes Memory Module and Solver Module to follow the instruction. (b) MrSteve leverages Place Event Memory for exploration and task execution, which stores the novel events from visited places.
  • Figure 3: Mode Selector and VPT-Nav in MrSteve. (a) Mode Selector with Place Episodic Memory. It decides agent's mode (Explore or Execute) based on whether a task-relevant resource is in the memory. It uses a hierarchical read operation. (b) Architecture of Goal-Conditioned VPT Navigator.
  • Figure 4: Agent's trajectories of length 6K steps on $100\times 100$ block map with different exploration methods. The leftmost figure is the agent's trajectory from our exploration method.
  • Figure 5: Success Rate and Task Duration of different agents in ABA-Sparse tasks. Task A refers to the first A task in the A-B-A task sequence, while Task A$'$ refers to the final A task in the A-B-A task sequence. We note that MrSteve, as well as its memory variants, outperforms Steve-1, which lacks the memory. Additionally, while Steve-1 takes a similar amount of time to solve both task A and task A$’$, MrSteve solves task A$’$ much faster. The full results on all 20 tasks are in Appendix \ref{['appen:aba_sparse']}, and investigations about memory variants are in Appendix \ref{['sec:aba-sprase-memory-limit']}.
  • ...and 14 more figures