Table of Contents
Fetching ...

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang

TL;DR

MemoryVLA tackles the non-Markovian nature of robotic manipulation by introducing a hippocampus-inspired Perceptual–Cognitive Memory Bank (PCMB) that stores long-horizon perceptual details and semantic representations. It combines a Vision-Language Cognition Module to form working memory with a memory Retrieval–Fusion–Consolidation mechanism, and a diffusion-based action head to generate temporally coherent 7-DoF actions. Across 150+ tasks on three robots and both simulated and real-world benchmarks, MemoryVLA surpasses state-of-the-art baselines (CogACT and pi_0) with notable gains on long-horizon tasks and demonstrates strong robustness to open-world distribution shifts. The work highlights the importance of explicit temporal memory in Vision-Language-Action models for scalable, generalizable robotic manipulation.

Abstract

Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

TL;DR

MemoryVLA tackles the non-Markovian nature of robotic manipulation by introducing a hippocampus-inspired Perceptual–Cognitive Memory Bank (PCMB) that stores long-horizon perceptual details and semantic representations. It combines a Vision-Language Cognition Module to form working memory with a memory Retrieval–Fusion–Consolidation mechanism, and a diffusion-based action head to generate temporally coherent 7-DoF actions. Across 150+ tasks on three robots and both simulated and real-world benchmarks, MemoryVLA surpasses state-of-the-art baselines (CogACT and pi_0) with notable gains on long-horizon tasks and demonstrates strong robustness to open-world distribution shifts. The work highlights the importance of explicit temporal memory in Vision-Language-Action models for scalable, generalizable robotic manipulation.

Abstract

Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

Paper Structure

This paper contains 59 sections, 9 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: (a) In Push Buttons tasks, pre- and post-push states look nearly identical, calling for temporal modeling. (b) Humans handle manipulation tasks via a dual-memory system: working memory (neural activity) supports short-term control, while episodic memory (hippocampus) preserves long-term experience. (c) Inspired by this, MemoryVLA introduces a Perceptual–Cognitive Memory Bank that consolidates low-level perceptual details and high-level cognitive semantics for temporally aware decision making. (d) MemoryVLA outperforms state-of-the-art baselines.
  • Figure 2: Overall architecture of MemoryVLA. RGB observation and language instruction are encoded by a 7B VLM into perceptual and cognitive tokens, forming short-term working memory. The working memory queries a perceptual-cognitive memory bank (PCMB) to retrieve relevant historical context, including high-level semantics and low-level visual details, adaptively fuses it with current tokens, and consolidates the PCMB by merging the most similar neighbors. The memory-augmented tokens then condition a diffusion transformer to predict a sequence of future actions.
  • Figure 3: Details of memory module. (a) Retrieval: current perceptual and cognitive tokens query the PCMB via cross-attention with timestep positional encoding to fetch relevant historical features. (b) Gate fusion: current and retrieved tokens are adaptively fused via a gate mechanism. (c) Consolidation: the fused tokens are updated into PCMB. When PCMB reaches its capacity, we compute similarities between adjacent entries and merge the most similar pair to maintain compactness.
  • Figure 4: Experimental setup overview. Top: three simulation benchmarks, SimpleEnv-Bridge with WidowX, SIMPLER-Fractal with Google Robot, and LIBERO with Franka. Bottom: real-world evaluation on two suites, General and Long-horizon Temporal. In total, we evaluate three robots across 10 suites, spanning over 150 tasks and 500 variations.
  • Figure 5: Robustness and generalization under out-of-distribution (OOD) conditions in real-world. (a,b) Examples of OOD variants for two representative tasks (Pick Place Order and Clean Restaurant Table), including unseen backgrounds, distractors, lighting, novel objects/containers, and occlusion. (c,d) Quantitative results showing that MemoryVLA maintains high success rates across these OOD variants, demonstrating strong robustness and generalization in real-world environments.
  • ...and 10 more figures