Table of Contents
Fetching ...

VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

Yuheng Lei, Zhixuan Liang, Hongyuan Zhang, Ping Luo

TL;DR

Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT.

Abstract

Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a Transformer-based contextual memory compressor that recursively converts out-of-window observations into a fixed number of episodic memory tokens. The compressor uses self-attention over a cache of past summary tokens and cross-attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short-term and episode-wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action (VLA) models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.

VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

TL;DR

Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT.

Abstract

Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a Transformer-based contextual memory compressor that recursively converts out-of-window observations into a fixed number of episodic memory tokens. The compressor uses self-attention over a cache of past summary tokens and cross-attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short-term and episode-wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action (VLA) models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.
Paper Structure (12 sections, 3 equations, 7 figures, 2 tables)

This paper contains 12 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Comparison between VPWEM and existing methods: (a) Policies typically predict actions based on observations within a fixed-size context window. Historical observations that move out of the window are typically discarded and will not be used during action generation. (b) VPWEM augments diffusion-based policies with working and episodic memories, where the contextual memory compressor recursively consolidates historical observation tokens into fixed-size memory tokens.
  • Figure 2: Overview of the VPWEM framework. (a) Policy architecture. Modules in green are learnable components, including the multi-modal encoder, contextual memory compressor, summary tokens, and the Transformer-based noise predictor in the diffusion policy. These components are optimized end-to-end with a behavior cloning loss. (b) Training process. Each training sample contains the complete trajectory from the beginning of the episode. The short-term memory component follows the standard diffusion policy baselines. Observations outside the context window are subsampled with a fixed ratio and passed to the contextual memory compressor, which outputs a fixed number of memory tokens. The combined long- and short-term memory tokens are used to condition the noise prediction network. (c) Inference process. Each decision step consists of three steps: encoding new frames and updating the observation and summary caches; compressing out-of-window tokens to obtain long-term memory tokens; and predicting the action chunk.
  • Figure 3: Three benchmarks in our experiments.
  • Figure 4: Performance on MoMaRT benchmark.
  • Figure 5: Performance on MIKASA benchmark.
  • ...and 2 more figures