Table of Contents
Fetching ...

SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation

Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, Jiafei Duan

TL;DR

SAM2Act introduces a multi-view, language-conditioned transformer-based policy for high-precision 3D robotic manipulation, leveraging visual foundation-model embeddings to improve generalization. Building on this, SAM2Act+ adds a memory-augmented architecture with a memory bank, encoder, and attention to enable spatial memory and episodic recall, evaluated via MemoryBench. The approach achieves state-of-the-art results on RLBench andColosseum benchmarks and demonstrates strong memory performance and real-world transfer, highlighting the value of integrating memory with foundation-model visual representations. While promising, the work notes limitations in dexterous control and semantic-memory storage, outlining avenues for future improvements.

Abstract

Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance gap under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves an average success rate of 94.3% on memory-based tasks in MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems. Project page: sam2act.github.io.

SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation

TL;DR

SAM2Act introduces a multi-view, language-conditioned transformer-based policy for high-precision 3D robotic manipulation, leveraging visual foundation-model embeddings to improve generalization. Building on this, SAM2Act+ adds a memory-augmented architecture with a memory bank, encoder, and attention to enable spatial memory and episodic recall, evaluated via MemoryBench. The approach achieves state-of-the-art results on RLBench andColosseum benchmarks and demonstrates strong memory performance and real-world transfer, highlighting the value of integrating memory with foundation-model visual representations. While promising, the work notes limitations in dexterous control and semantic-memory storage, outlining avenues for future improvements.

Abstract

Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance gap under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves an average success rate of 94.3% on memory-based tasks in MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems. Project page: sam2act.github.io.

Paper Structure

This paper contains 32 sections, 3 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: SAM2Act is a multi-view, language-conditioned behavior cloning policy trained with fewer demonstrations. Given a language instruction, it can execute high-precision tasks, such as turning the tiny knob on the lamp. It also generalizes to various environmental variations, such as changes in lighting conditions. Through further training with our proposed memory architecture, it now evolves into SAM2Act+, which is now capable of solving tasks that require implicit spatial memory—such as remembering where the robot previously stored the pliers, as depicted in the above figure.
  • Figure 2: Simulation and Real Tasks. We demonstrate the effectiveness of SAM2Act+ in solving memory-based tasks by evaluating it against baselines on the three benchmark memory tasks (shown at the top). Additionally, we validate our approach using a Franka Panda robot on four real-world tasks (shown at the bottom), including tests under out-of-distribution perturbations.
  • Figure 3: Overview of the SAM2Act (top) and SAM2Act+ (bottom) architectures. The SAM2Act architecture leverages the SAM2 image encoder to generate prompt-conditioned, multi-resolution embeddings, fine-tuned with LoRA for efficient adaptation to manipulation tasks. A multi-view transformer aligns spatial coordinates with language instructions, while a cascaded multi-resolution upsampling mechanism refines feature maps and generates accurate translation heatmaps. SAM2Act+ extends this architecture by incorporating memory-based components, including the Memory Encoder, Memory Attention, and Memory Bank, into the coarse branch. These components enable memory-driven reasoning by processing historical heatmaps and integrating prior observations, allowing the agent to predict actions based on stored contextual information. Observations are reconstructed into point clouds, rendered into three virtual images, and lifted into 3D translation points, enabling precise spatial reasoning across both architectures.
  • Figure 4: SAM2Act Module and multi-resolution upsampling mechanism. A cascade of three convex upsamplers processes feature maps at increasing resolutions, integrating multi-resolution embeddings from the SAM2 image encoder through elementwise addition and layer normalization. The upsamplers progressively refine features, doubling spatial dimensions at each stage, to generate accurate translation heatmaps while capturing fine-grained spatial details critical for manipulation tasks.
  • Figure 5: Real-world Robot Setup. A Franka Panda robot with a Robotiq Gripper. A RealSense D455 depth sensor captures the scene.