Table of Contents
Fetching ...

Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective

Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa Vo, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, Ngan Le

TL;DR

The paper tackles the challenge of memory for object-centric robotic manipulation in non-Markovian environments. It introduces LIBERO-Mem, a benchmark suite that stresses object-level memory with long-horizon, temporally entangled tasks and object identity ambiguities, and presents Embodied-SlotSSM, a slot-based memory framework that couples persistent object slots with a relational action decoder. The approach combines transient localization via Slot Attention and a SlotSSM-based dynamics model to maintain object identity over time and support memory-grounded decision making, showing improvements over reactive baselines in both general and non-Markovian tasks. This work advances scalable, memory-aware visuomotor systems for robotics, with implications for more reliable long-horizon manipulation in cluttered and ambiguous real-world settings.

Abstract

As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.

Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective

TL;DR

The paper tackles the challenge of memory for object-centric robotic manipulation in non-Markovian environments. It introduces LIBERO-Mem, a benchmark suite that stresses object-level memory with long-horizon, temporally entangled tasks and object identity ambiguities, and presents Embodied-SlotSSM, a slot-based memory framework that couples persistent object slots with a relational action decoder. The approach combines transient localization via Slot Attention and a SlotSSM-based dynamics model to maintain object identity over time and support memory-grounded decision making, showing improvements over reactive baselines in both general and non-Markovian tasks. This work advances scalable, memory-aware visuomotor systems for robotics, with implications for more reliable long-horizon manipulation in cluttered and ambiguous real-world settings.

Abstract

As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.

Paper Structure

This paper contains 36 sections, 1 theorem, 11 equations, 10 figures, 5 tables.

Key Result

Proposition 1

Let $\mathcal{O}_t = \{ o_t^{(1)}, \ldots, o_t^{(k)} \}$ be a set of $k$ objects at time $t$, each represented by a latent $z_t^{(j)} = f_{\text{enc}}(v_t^{(j)})$ derived from the visual input $v_t^{(j)}$. Suppose that for all $i \ne j$, $v_t^{(i)} \approx v_t^{(j)}$ such that $z_t^{(i)} \approx z_t

Figures (10)

  • Figure 1: LIBERO-Mem: robotic manipulation tasks of object-level POMDP dependencies. These tasks require memory of prior actions and object-specific state tracking beyond what purely Markovian or fully observable policies can handle, highlighting the importance of persistent, object-specific memory for short- and long-horizon reasoning across visually similar inputs. In (a) object motion (OM), the robot must recall its last action (e.g., pick up or place down) to act correctly. In (b) object sequence (OS), success depends on remembering how many times an object has been manipulated, since visual cues are insufficient. In (c) multi-object sequence (OR), the robot must track the temporal order of object relations and interactions (e.g., from left to right). In (d) multi-object occlusion (OO), occluded objects require the robot to rely on memory of past placements to identify targets.
  • Figure 1: The histogram of frame count across different subsets of LIBERO and our proposed LIBERO-Mem.
  • Figure 2: Embodied-SlotSSM: Our framework combining slot-based dynamics (Slot Attention, Slot Fusion, Slot-based SSM) with an LLM Action Decoder for object memory-aware action prediction based on textual prompts.
  • Figure 2: The detailed histograms of frame count across different subsets of LIBERO and our proposed LIBERO-Mem.
  • Figure 3: Slot visualization in task T1: gripper and bowl slots (bbox, attention) as robot lifts and places the bowl down.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Proposition 1: Object history enables individuation under visual ambiguity