MrSteve: Instruction-Following Agents in Minecraft with What-Where-When Memory
Junyeong Park, Junmo Cho, Sungjin Ahn
TL;DR
MrSteve introduces Place Event Memory (PEM), a What-Where-When episodic memory system for a low-level Minecraft controller to overcome the short-memory bottleneck of prior agents like Steve-1. Built atop PEM, the Memory-Augmented Task Solving Framework enables switching between exploration and goal-directed execution, guided by a Count-Based high-level strategy and a goal-conditioned VPT-Nav navigator. Across extensive experiments in sparse, long-horizon, and memory-constrained settings, MrSteve yields superior exploration efficiency and faster, more reliable task solving compared to baselines, with strong performance in long-horizon scenarios. The work suggests that incorporating hierarchical, event-aware episodic memory into low-level controllers significantly enhances generalization and efficiency in embodied AI tasks, and it provides code and demos to promote reproducibility and further research.
Abstract
Significant advances have been made in developing general-purpose embodied AI in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. While these approaches, which combine high-level planners with low-level controllers, show promise, low-level controllers frequently become performance bottlenecks due to repeated failures. In this paper, we argue that the primary cause of failure in many low-level controllers is the absence of an episodic memory system. To address this, we introduce MrSteve (Memory Recall Steve), a novel low-level controller equipped with Place Event Memory (PEM), a form of episodic memory that captures what, where, and when information from episodes. This directly addresses the main limitation of the popular low-level controller, Steve-1. Unlike previous models that rely on short-term memory, PEM organizes spatial and event-based data, enabling efficient recall and navigation in long-horizon tasks. Additionally, we propose an Exploration Strategy and a Memory-Augmented Task Solving Framework, allowing agents to alternate between exploration and task-solving based on recalled events. Our approach significantly improves task-solving and exploration efficiency compared to existing methods. We will release our code and demos on the project page: https://sites.google.com/view/mr-steve.
