Table of Contents
Fetching ...

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, Joyce Chai

TL;DR

This work introduces RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios and develops a suite of 14 memory-augmented VLA variants built on the {\pi}0.5 backbone to systematically explore different memory representations across multiple integration strategies.

Abstract

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

TL;DR

This work introduces RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios and develops a suite of 14 memory-augmented VLA variants built on the {\pi}0.5 backbone to systematically explore different memory representations across multiple integration strategies.

Abstract

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.
Paper Structure (113 sections, 29 equations, 26 figures, 26 tables)

This paper contains 113 sections, 29 equations, 26 figures, 26 tables.

Figures (26)

  • Figure 2: Framework of MME-VLA Suite. The top part illustrates three memory representations, each with two instantiations: (1) Symbolic Memory summarizes past interactions as high-level abstractions via language-based subgoals, optionally grounded to image pixels; (2) Perceptual Memory encodes history as raw visual tokens, using either token dropping to remove redundancy or uniform frame sampling to preserve essential context; (3) Recurrent Memory compresses history into fixed-size latent states through test-time training or recurrent memory transformers. The bottom part shows three integration strategies consistent with $\pi_{0.5}$: (1) Memory-as-Context concatenates memory tokens with observation tokens (e.g., images, task instructions, and proprioception); (2) Memory-as-Modulator applies adaptive LayerNorm to modulate the action expert, where action features cross-attend to memory tokens via multi-head attention; (3) Memory-as-Expert adds a separate memory expert that interacts with other experts through block-wise causal attention.
  • Figure 3: Performance comparison across task characteristics.
  • Figure 4: Efficiency-performance comparison among models.
  • Figure 5: Example Prompt for VLM Simple Subgoal Prediction. This is used to generate simple subgoals for the SimpleSG VLA policy.
  • Figure 6: Example Prompt for VLM Grounded Subgoal Prediction. This is used to generate grounded subgoals for the GroundedSG VLA policy.
  • ...and 21 more figures