Table of Contents
Fetching ...

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z. Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, Karan Dhabalia, Michael Equi, Quan Vuong, Jost Tobias Springenberg, Sergey Levine, Chelsea Finn, Danny Driess

TL;DR

This work introduces Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies that combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory.

Abstract

Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

TL;DR

This work introduces Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies that combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory.

Abstract

Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.
Paper Structure (35 sections, 5 equations, 9 figures)

This paper contains 35 sections, 5 equations, 9 figures.

Figures (9)

  • Figure 0: The MEM memory system equips VLAs like $\pi_{0.6}$ with long-horizon memory via two key components: (1) a high-level policy is trained to keep track of long-horizon semantic events by updating a language memory$m_t$ (left, \ref{['sec:language_memory']}), (2) the low-level policy uses a short-horizon observation-based memory that is efficiently encoded via a video encoder (right, \ref{['sec:video_encoder']}).
  • Figure 1: Naively passing a sequence of observations into the backbone of a VLA rapidly increases inference latency. Our efficient video encoder architecture allows us to use many observation frames while remaining under critical real-time inference thresholds black2025realblack2025training. Timings measured for the $\pi_{0.6}$ VLA, with four input camera streams on one NVIDIA H100 GPU.
  • Figure 2: We propose to use an efficient video encoder architecture for compressing short-horizon, image-based memory. Our architecture expands standard ViTs for encoding video inputs by interleaving layers that apply bidirectional spatial attention within each observation (white arrows) with layers that additionally apply causal-temporal attention operations across observations (black arrows). We drop observation tokens for past timesteps in upper layers of the ViT to compress the inputs and reduce the number of tokens passed to the VLA backbone.
  • Figure 3: We test MEM policies across multiple challenging, long-horizon dexterous manipulation tasks that require retaining memory for up to fifteen minutes, including setting up a recipe, cleaning up a kitchen (\ref{['sec:challenge_tasks']}), and making a grilled cheese sandwich (\ref{['sec:analysis']}).
  • Figure 4: Performance of policies on challenging, long-horizon manipulation tasks. Without memory, even state-of-the-art generalist policies like $\pi_{0.6}$ struggle to perform such tasks. We ablate the memory components and show these tasks are solvable by combining short-horizon, observation-based memory, with long-horizon, language-based memory. Naive memory of past language instructions, without compression, struggles with training-inference distribution shifts.
  • ...and 4 more figures