Table of Contents
Fetching ...

MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation

Runhao Li, Wenkai Guo, Zhenyu Wu, Changyuan Wang, Haoyuan Deng, Zhenyu Weng, Yap-Peng Tan, Ziwei Wang

TL;DR

MAP-VLA tackles the memory deficiency of pre-trained Vision-Language-Action models in long-horizon robotic manipulation by introducing demonstration-derived memory prompts. It builds a memory library via Memory Prompt Construction and uses Memory-Augmented Action Generation to retrieve stage-specific prompts and blend them with base prompts through dynamic ensembling, all while keeping the VLA weights frozen. Empirical results on LIBERO-Libero-Long and real-robot tasks show consistent gains over state-of-the-art baselines, including up to 7.0% absolute improvement in simulation and 25.0% in real-world evaluations, along with improved robustness to visual variations. The approach is lightweight and plug-and-play, offering practical benefits for robust, memory-aware robotic autonomy in both domestic and industrial settings.

Abstract

Pre-trained Vision-Language-Action (VLA) models have achieved remarkable success in improving robustness and generalization for end-to-end robotic manipulation. However, these models struggle with long-horizon tasks due to their lack of memory and reliance solely on immediate sensory inputs. To address this limitation, we propose Memory-Augmented Prompting for Vision-Language-Action model (MAP-VLA), a novel framework that empowers pre-trained VLA models with demonstration-derived memory prompts to augment action generation for long-horizon robotic manipulation tasks. To achieve this, MAP-VLA first constructs a memory library from historical demonstrations, where each memory unit captures information about a specific stage of a task. These memory units are implemented as learnable soft prompts optimized through prompt tuning. Then, during real-time task execution, MAP-VLA retrieves relevant memory through trajectory similarity matching and dynamically integrates it into the VLA model for augmented action generation. Importantly, this prompt tuning and retrieval augmentation approach operates as a plug-and-play module for a frozen VLA model, offering a lightweight and flexible solution to improve task performance. Experimental results show that MAP-VLA delivers up to 7.0% absolute performance gains in the simulation benchmark and 25.0% on real robot evaluations for long-horizon tasks, surpassing the current state-of-the-art methods.

MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation

TL;DR

MAP-VLA tackles the memory deficiency of pre-trained Vision-Language-Action models in long-horizon robotic manipulation by introducing demonstration-derived memory prompts. It builds a memory library via Memory Prompt Construction and uses Memory-Augmented Action Generation to retrieve stage-specific prompts and blend them with base prompts through dynamic ensembling, all while keeping the VLA weights frozen. Empirical results on LIBERO-Libero-Long and real-robot tasks show consistent gains over state-of-the-art baselines, including up to 7.0% absolute improvement in simulation and 25.0% in real-world evaluations, along with improved robustness to visual variations. The approach is lightweight and plug-and-play, offering practical benefits for robust, memory-aware robotic autonomy in both domestic and industrial settings.

Abstract

Pre-trained Vision-Language-Action (VLA) models have achieved remarkable success in improving robustness and generalization for end-to-end robotic manipulation. However, these models struggle with long-horizon tasks due to their lack of memory and reliance solely on immediate sensory inputs. To address this limitation, we propose Memory-Augmented Prompting for Vision-Language-Action model (MAP-VLA), a novel framework that empowers pre-trained VLA models with demonstration-derived memory prompts to augment action generation for long-horizon robotic manipulation tasks. To achieve this, MAP-VLA first constructs a memory library from historical demonstrations, where each memory unit captures information about a specific stage of a task. These memory units are implemented as learnable soft prompts optimized through prompt tuning. Then, during real-time task execution, MAP-VLA retrieves relevant memory through trajectory similarity matching and dynamically integrates it into the VLA model for augmented action generation. Importantly, this prompt tuning and retrieval augmentation approach operates as a plug-and-play module for a frozen VLA model, offering a lightweight and flexible solution to improve task performance. Experimental results show that MAP-VLA delivers up to 7.0% absolute performance gains in the simulation benchmark and 25.0% on real robot evaluations for long-horizon tasks, surpassing the current state-of-the-art methods.

Paper Structure

This paper contains 16 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Simplified execution pipeline of existing VLA methods and MAP-VLA.
  • Figure 2: The framework of MAP-VLA. Our method augments a frozen pre-trained VLA model with demonstration-derived memory prompts for enhanced action generation during task execution. The Memory Prompt Construction stage encodes stage-specific knowledge from expert demonstrations into a library of memory prompts. The Memory-Augmented Action Generation stage retrieves the memory prompts and augments action generation with memory-aware prompt ensembling.
  • Figure 3: Performance comparison on all LIBERO task suites, "*" denotes results reported by OpenVLA kim2024openvla.
  • Figure 4: Real-world environment setup.
  • Figure 5: Visualization and comparison of Task2: place the green cube and orange into the bowl.
  • ...and 1 more figures