Table of Contents
Fetching ...

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Junjie Zhang, Chenjia Bai, Haoran He, Wenke Xia, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li

TL;DR

SAM-E tackles generalization and efficiency in language-conditioned 3D manipulation by combining a vision foundation model with sequence imitation. It introduces a LoRA-finetuned SAM encoder, a multi-view transformer for cross-view and language alignment, and a multi-channel action-sequence heatmap head for single-pass planning. Across RLBench tasks, SAM-E achieves higher success rates and substantially fewer inference steps than baselines, with strong few-shot adaptation and real-world feasibility. The results highlight the potential of visual foundation models for embodied agents and the benefits of action-sequence modeling for long-horizon manipulation.

Abstract

Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

TL;DR

SAM-E tackles generalization and efficiency in language-conditioned 3D manipulation by combining a vision foundation model with sequence imitation. It introduces a LoRA-finetuned SAM encoder, a multi-view transformer for cross-view and language alignment, and a multi-channel action-sequence heatmap head for single-pass planning. Across RLBench tasks, SAM-E achieves higher success rates and substantially fewer inference steps than baselines, with strong few-shot adaptation and real-world feasibility. The results highlight the potential of visual foundation models for embodied agents and the benefits of action-sequence modeling for long-horizon manipulation.

Abstract

Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.
Paper Structure (28 sections, 7 equations, 14 figures, 11 tables)

This paper contains 28 sections, 7 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Overview of SAM-E. (i) The SAM encoder provides promptable visual embedding of single-view observations after fine-tuning on embodied scenarios with parameter-efficient LoRA. (ii) Multi-view transformer achieves cross-view information integration and vision-language alignment. (iii) The coherent action sequence is predicted via temporal imitation for efficient multi-step execution.
  • Figure 2: Multi-view Transformer has two stages for view-wise information and cross-view information integration.
  • Figure 3: Movement shift in positions and rotations of the end effector in RLBench task close_jar, representing smooth changes of positions and rotations in temporally adjacent steps.
  • Figure 4: The Action-Sequence Policy Head outputs multi-channel pose heatmaps for a sequence of positions and rotations.
  • Figure 5: The comparison of training curves from 5 seeds with $\pm$1 std. We observe that SAM-E achieves a higher success rate than R3M and non-pre-trained baselines. Meanwhile, SAM and its variations achieve a better training efficiency compared to RVT, benefiting from action sequence imitation. The training curve of RVT is from our reproduction by running the official code.
  • ...and 9 more figures