Table of Contents
Fetching ...

MemoAct: Atkinson-Shiffrin-Inspired Memory-Augmented Visuomotor Policy for Robotic Manipulation

Liufan Tan, Jiale Li, Gangshan Jing

Abstract

Memory-augmented robotic policies are essential in handling memory-dependent tasks. However, existing approaches typically rely on simple observation window extensions, struggling to simultaneously achieve precise task state tracking and robust long-horizon retention. To overcome these challenges, inspired by the Atkinson-Shiffrin memory model, we propose MemoAct, a hierarchical memory-based policy that leverages distinct memory tiers to tackle specific bottlenecks. Specifically, lossless short-term memory ensures precise task state tracking, while compressed long-term memory enables robust long-horizon retention. To enrich the evaluation landscape, we construct MemoryRTBench based on RoboTwin 2.0, specifically tailored to assess policy capabilities in task state tracking and long-horizon retention. Extensive experiments across simulated and real-world scenarios demonstrate that MemoAct achieves superior performance compared to both existing Markovian baselines and history-aware policies. The project page is \href{https://tlf-tlf.github.io/MemoActPage/}{available}.

MemoAct: Atkinson-Shiffrin-Inspired Memory-Augmented Visuomotor Policy for Robotic Manipulation

Abstract

Memory-augmented robotic policies are essential in handling memory-dependent tasks. However, existing approaches typically rely on simple observation window extensions, struggling to simultaneously achieve precise task state tracking and robust long-horizon retention. To overcome these challenges, inspired by the Atkinson-Shiffrin memory model, we propose MemoAct, a hierarchical memory-based policy that leverages distinct memory tiers to tackle specific bottlenecks. Specifically, lossless short-term memory ensures precise task state tracking, while compressed long-term memory enables robust long-horizon retention. To enrich the evaluation landscape, we construct MemoryRTBench based on RoboTwin 2.0, specifically tailored to assess policy capabilities in task state tracking and long-horizon retention. Extensive experiments across simulated and real-world scenarios demonstrate that MemoAct achieves superior performance compared to both existing Markovian baselines and history-aware policies. The project page is \href{https://tlf-tlf.github.io/MemoActPage/}{available}.
Paper Structure (15 sections, 3 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 3 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) An example of a memory-dependent task. (b) Policies lacking historical awareness fail under identical observations, while existing representative memory mechanisms suffer from limited long-horizon retention and poor task state tracking. (c) Inspired by the Atkinson--Shiffrin memory model, we propose MemoAct, which simultaneously enables precise task state tracking and robust long-horizon retention. (d) Results on MemoryRTBench, RMBench and real-world experiments demonstrate that MemoAct significantly outperforms baseline algorithms.
  • Figure 2: Overview of MemoAct architecture. The sensory distillation module first encodes RGB images and proprioceptive states into high-fidelity features, termed sensory memory. This memory serves as a query to retrieve relevant historical context from the long short-term memory bank, which is processed by a temporal transformer encoder. Subsequently, a gating network adaptively fuses the retrieved history with the current sensory memory to produce a condition token. Guided by this token, the action decoder iteratively denoises random noise into history-aware action trajectories. Finally, the consolidation module updates the memory bank after each forward pass.
  • Figure 3: Illustration of the Long Short-Term Memory Consolidation Module. Newly generated observation embeddings are first appended to the STMB. Upon saturation of the STMB capacity, the earliest $N_{sc}$ entries are compressed by feeding them, alongside a learnable summary token, into the temporal transformer encoder. The resulting summary token is migrated to the LTMB, while the original $N_{sc}$ entries are discarded. The most similar adjacent pair in the LTMB is merged to ensure storage efficiency.
  • Figure 4: Overview of simulation and real-world tasks. The figure illustrates four simulation tasks from MemoryRTBench alongside two additional real-world tasks, designed to evaluate the model's capabilities in task state tracking and long-horizon retention. The tasks are executed sequentially following the alphabetical order (i.e., A $\to$ B $\to$ C $\to$ ...). Notably, identical observations encountered during the execution are highlighted in red or blue.
  • Figure 5: Performance comparison of MemoAct under different memory capacities (%). Long-term memory primarily dictates long-horizon retention capabilities, whereas short-term memory is critical for maintaining accurate task state tracking.