Table of Contents
Fetching ...

Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao, Yuhang Han, Jianfei Yang

Abstract

Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.

Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

Abstract

Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.

Paper Structure

This paper contains 90 sections, 46 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Memory-intensive robotic manipulation is often perceptually aliased: in the shell game, identical cups look the same at decision time while the correct grasp depends on earlier interactions. Existing approaches often discard fine-grained details needed for reliable decision-making. Inspired by the human Entorhinal Cortex–Hippocampus–Prefrontal Cortex (EC–HC–PFC) episodic memory system, we propose Chameleon for episodic recall, spatial tracking, and long-horizon task completion.
  • Figure 2: Method overview. Chameleon follows a Perception $\rightarrow$ Memory $\rightarrow$ Policy pipeline for long-horizon manipulation under perceptual aliasing. Perception produces geometry-grounded, view-consistent tokens that disambiguate observations before memory formation. Memory couples an episodic state with a working state and learns relevance-based recall through HoloHead rollout training. Conditioned on the memory readout, Policy samples a future end-effector pose trajectory via conditional flow matching for execution.
  • Figure 3: Illustration of task execution in Camo-Dataset. Clean a specified plate: the agent remembers the user’s newly placed plate. At the decision stage, three plates are similar. Play shell game: the agent tracks the cup covering the cube. At the decision stage, the three cups appear identical. Add various seasonings: the agent recalls which spoon was picked up. At the decision stage, the three spoons are similar.
  • Figure 4: Pattern separation in the decision state. UMAP projections of $h_t$ for all three task categories, colored by the latent decision target. Clear clustering under visually aliasing indicates that $h_t$ disambiguates the history-dependent state.
  • Figure 5: Dorsal stream enables write-time disambiguation. Cross-view attention in the spatial task: with the dorsal stream, attention focuses on the correct target via end-effector geometry and epipolar feasibility; without it, attention diffuses across distractors, degrading downstream memory and execution.
  • ...and 3 more figures