MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Marcel Torne; Karl Pertsch; Homer Walke; Kyle Vedder; Suraj Nair; Brian Ichter; Allen Z. Ren; Haohuan Wang; Jiaming Tang; Kyle Stachowicz; Karan Dhabalia; Michael Equi; Quan Vuong; Jost Tobias Springenberg; Sergey Levine; Chelsea Finn; Danny Driess

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z. Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, Karan Dhabalia, Michael Equi, Quan Vuong, Jost Tobias Springenberg, Sergey Levine, Chelsea Finn, Danny Driess

TL;DR

This work introduces Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies that combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory.

Abstract

Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

TL;DR

Abstract

Paper Structure (35 sections, 5 equations, 9 figures)

This paper contains 35 sections, 5 equations, 9 figures.

Introduction
Related Work
Multi-Scale Embodied Memory for VLAs
Multi-Scale Embodied Memory (MEM)
Language Memory for Long-Term Memory
Video Encoder for Dense Short-Term Visual Memory
Integrating MEM into the $\pi_{0.6}$ VLA
Experimental Evaluation
MEM Solves Tasks Requiring Long-Horizon Memory
In-Context Adaptation of Manipulation Strategies
Analysis Experiments
Conclusion
Contributions
Task Details
Long-horizon Tasks (\ref{['sec:challenge_tasks']})
...and 20 more sections

Figures (9)

Figure 0: The MEM memory system equips VLAs like $\pi_{0.6}$ with long-horizon memory via two key components: (1) a high-level policy is trained to keep track of long-horizon semantic events by updating a language memory$m_t$ (left, \ref{['sec:language_memory']}), (2) the low-level policy uses a short-horizon observation-based memory that is efficiently encoded via a video encoder (right, \ref{['sec:video_encoder']}).
Figure 1: Naively passing a sequence of observations into the backbone of a VLA rapidly increases inference latency. Our efficient video encoder architecture allows us to use many observation frames while remaining under critical real-time inference thresholds black2025realblack2025training. Timings measured for the $\pi_{0.6}$ VLA, with four input camera streams on one NVIDIA H100 GPU.
Figure 2: We propose to use an efficient video encoder architecture for compressing short-horizon, image-based memory. Our architecture expands standard ViTs for encoding video inputs by interleaving layers that apply bidirectional spatial attention within each observation (white arrows) with layers that additionally apply causal-temporal attention operations across observations (black arrows). We drop observation tokens for past timesteps in upper layers of the ViT to compress the inputs and reduce the number of tokens passed to the VLA backbone.
Figure 3: We test MEM policies across multiple challenging, long-horizon dexterous manipulation tasks that require retaining memory for up to fifteen minutes, including setting up a recipe, cleaning up a kitchen (\ref{['sec:challenge_tasks']}), and making a grilled cheese sandwich (\ref{['sec:analysis']}).
Figure 4: Performance of policies on challenging, long-horizon manipulation tasks. Without memory, even state-of-the-art generalist policies like $\pi_{0.6}$ struggle to perform such tasks. We ablate the memory components and show these tasks are solvable by combining short-horizon, observation-based memory, with long-horizon, language-based memory. Naive memory of past language instructions, without compression, struggles with training-inference distribution shifts.
...and 4 more figures

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

TL;DR

Abstract

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)