Table of Contents
Fetching ...

SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

Shengkai Wu, Jinrong Yang, Wenqiu Luo, Linfeng Gao, Chaohui Shang, Meiyu Zhi, Mingshan Sun, Fangping Yang, Liangliang Ren, Yong Zhao

TL;DR

Multimodality in imitation learning hinders robotic grasping when multiple objects are present. SAM2Grasp resolves this by conditioning the policy on a target prompt and using a frozen SAM2 backbone with a lightweight ACT head, coupled with offline feature caching and asynchronous, temporally-ensembled inference. It achieves state-of-the-art performance in cluttered multi-object grasping and shows strong robustness to occlusion in both simulation and real-world experiments, while remaining training-efficient. The approach opens doors to language-guided or prompt-driven manipulation and scalable extension to broader robotic tasks.

Abstract

Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2's built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.

SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

TL;DR

Multimodality in imitation learning hinders robotic grasping when multiple objects are present. SAM2Grasp resolves this by conditioning the policy on a target prompt and using a frozen SAM2 backbone with a lightweight ACT head, coupled with offline feature caching and asynchronous, temporally-ensembled inference. It achieves state-of-the-art performance in cluttered multi-object grasping and shows strong robustness to occlusion in both simulation and real-world experiments, while remaining training-efficient. The approach opens doors to language-guided or prompt-driven manipulation and scalable extension to broader robotic tasks.

Abstract

Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2's built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.

Paper Structure

This paper contains 22 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The Multi-modality Problem in Imitation Learning and Our Approach. (a) Given a single observation with multiple valid objects, expert demonstrations can be multi-modal (e.g., grasping the left or right object). (b) A standard Behavioral Cloning (BC) policy trained on this data fails by "mode averaging," predicting a physically nonsensical action that targets the empty space between objects. (c) Our method, SAM2Grasp, resolves this ambiguity by taking an additional prompt that specifies the target. This transforms the multi-modal problem into a uni-modal one, enabling the policy to generate a correct and unambiguous action.
  • Figure 2: The SAM2Grasp Architecture. Our framework uses a prompt to guide a frozen SAM2 model in extracting object-centric features, which are then fed into a trainable ACT policy head. At $t=0$, an external prompt $p$ is required. For $t>0$, SAM2's internal temporal memory handles object tracking autonomously. This design resolves object-level multimodality at the perception stage.
  • Figure 3: Simulation Experiments.
  • Figure 4: Real-World Experiments.
  • Figure 5: Robustness to Visual Occlusion in Simulation. Success rate of all methods as a function of the frame occlusion rate ($p$). SAM2Grasp exhibits remarkable resilience, showing only a graceful degradation in performance, whereas the baseline methods collapse under increasing visual perturbation. This directly demonstrates the critical role of SAM2's built-in temporal memory for handling visual occlusion.