Table of Contents
Fetching ...

ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

Hang Li, Fengyi Shen, Dong Chen, Liudi Yang, Xudong Wang, Jinkui Shi, Zhenshan Bing, Ziyuan Liu, Alois Knoll

Abstract

Vision-language-action (VLA) models for closed-loop robot control are typically cast under the Markov assumption, making them prone to errors on tasks requiring historical context. To incorporate memory, existing VLAs either retrieve from a memory bank, which can be misled by distractors, or extend the frame window, whose fixed horizon still limits long-term retention. In this paper, we introduce ReMem-VLA, a Recurrent Memory VLA model equipped with two sets of learnable queries: frame-level recurrent memory queries for propagating information across consecutive frames to support short-term memory, and chunk-level recurrent memory queries for carrying context across temporal chunks for long-term memory. These queries are trained end-to-end to aggregate and maintain relevant context over time, implicitly guiding the model's decisions without additional training or inference cost. Furthermore, to enhance visual memory, we introduce Past Observation Prediction as an auxiliary training objective. Through extensive memory-centric simulation and real-world robot experiments, we demonstrate that ReMem-VLA exhibits strong memory capabilities across multiple dimensions, including spatial, sequential, episodic, temporal, and visual memory. ReMem-VLA significantly outperforms memory-free VLA baselines $π$0.5 and OpenVLA-OFT and surpasses MemoryVLA on memory-dependent tasks by a large margin.

ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

Abstract

Vision-language-action (VLA) models for closed-loop robot control are typically cast under the Markov assumption, making them prone to errors on tasks requiring historical context. To incorporate memory, existing VLAs either retrieve from a memory bank, which can be misled by distractors, or extend the frame window, whose fixed horizon still limits long-term retention. In this paper, we introduce ReMem-VLA, a Recurrent Memory VLA model equipped with two sets of learnable queries: frame-level recurrent memory queries for propagating information across consecutive frames to support short-term memory, and chunk-level recurrent memory queries for carrying context across temporal chunks for long-term memory. These queries are trained end-to-end to aggregate and maintain relevant context over time, implicitly guiding the model's decisions without additional training or inference cost. Furthermore, to enhance visual memory, we introduce Past Observation Prediction as an auxiliary training objective. Through extensive memory-centric simulation and real-world robot experiments, we demonstrate that ReMem-VLA exhibits strong memory capabilities across multiple dimensions, including spatial, sequential, episodic, temporal, and visual memory. ReMem-VLA significantly outperforms memory-free VLA baselines 0.5 and OpenVLA-OFT and surpasses MemoryVLA on memory-dependent tasks by a large margin.
Paper Structure (21 sections, 9 equations, 6 figures, 2 tables)

This paper contains 21 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of ReMem-VLA. Compared to vanilla VLA models, ReMem-VLA is equipped with frame-level recurrent memory for short-term retention (e.g., maintaining a fixed pose over several seconds) and chunk-level recurrent memory for long-term memory (e.g., tracking overall task progress). Additionally, ReMem-VLA incorporates past observation prediction to recall historical visual information to strengthen visual memory. ReMem-VLA demonstrates superior performance across spatial, temporal, episodic, sequential, and visual memory benchmarks, substantially outperforming all baselines.
  • Figure 2: Model Architecture of ReMem-VLA. ReMem-VLA continuously processes visual observations and language instructions at each timestep, predicting an action chunk and a past frame. Simultaneously, two sets of learnable recurrent memory queries extract information from the VLM backbone and propagate it to future timesteps via exponential moving average (EMA) updates. Among them, Frame-level memory queries are updated at every frame for short-term memory, while chunk-level memory queries are updated only at chunk boundaries (every $K$ frames) for long-term retention. Action queries and hindsight queries extract features from the VLM and interact with these recurrent memory queries through bidirectional attention in connector, fusing current observations with accumulated historical context to condition the predictions.
  • Figure 3: MemoryBench SAM2ACT with our extended long horizon task. 1)Put Block Back: Place the block to the center, presses a button, and then returns the block to its original location. 2)Rearrange Block: Move the block from the center to the unoccupied red patch, presses a button, and then relocates the block that was originally on the red patch to the center. 3)Reopen Drawer: Close the open drawer, presses a button, and then reopen the same drawer. 4) Long Horizon Task: We extend the memorybench with an additional long horizon task to evaluate the VLA's memory over long horizon. This task is a combination of Put Block Back and Rearrange Block.
  • Figure 4: Real world experiments and quantitative results on memory-dependent tasks.
  • Figure 5: Failure analysis and Ablation on Past Observation Prediction(POP).
  • ...and 1 more figures