Table of Contents
Fetching ...

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu

TL;DR

EmbodiedSAM tackles online, real-time 3D instance segmentation in streaming RGB-D scenes by lifting 2D SAM masks into 3D queries and refining them with a dual-level decoder to produce frame-consistent masks $M_t^{cur}$. It introduces geometric-aware pooling to create 3D superpoint features, and a fast, matrix-based merging strategy using geometric, contrastive, and semantic auxiliary tasks to link current and past masks via a similarity matrix $ ext{$\mathcal{C}$}$. The method achieves state-of-the-art performance on multiple datasets, runs in real-time (about $80$ ms per frame on a single RTX 3090, with faster variants using FastSAM), and demonstrates strong generalization and data efficiency, including open-vocabulary potential. These results suggest a practical pathway to deploy powerful 2D foundation models for embodied perception in real-world robotics and navigation tasks.

Abstract

Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with superior performance, which makes the use of VFM to assist embodied 3D perception a promising direction. However, most existing VFM-assisted 3D perception methods are either offline or too slow that cannot be applied in practical embodied tasks. In this paper, we aim to leverage Segment Anything Model (SAM) for real-time 3D instance segmentation in an online setting. This is a challenging problem since future frames are not available in the input streaming RGB-D video, and an instance may be observed in several frames so object matching between frames is required. To address these challenges, we first propose a geometric-aware query lifting module to represent the 2D masks generated by SAM by 3D-aware queries, which is then iteratively refined by a dual-level query decoder. In this way, the 2D masks are transferred to fine-grained shapes on 3D point clouds. Benefit from the query representation for 3D masks, we can compute the similarity matrix between the 3D masks from different views by efficient matrix operation, which enables real-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScan show our method achieves leading performance even compared with offline methods. Our method also demonstrates great generalization ability in several zero-shot dataset transferring experiments and show great potential in open-vocabulary and data-efficient setting. Code and demo are available at https://xuxw98.github.io/ESAM/, with only one RTX 3090 GPU required for training and evaluation.

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

TL;DR

EmbodiedSAM tackles online, real-time 3D instance segmentation in streaming RGB-D scenes by lifting 2D SAM masks into 3D queries and refining them with a dual-level decoder to produce frame-consistent masks . It introduces geometric-aware pooling to create 3D superpoint features, and a fast, matrix-based merging strategy using geometric, contrastive, and semantic auxiliary tasks to link current and past masks via a similarity matrix \mathcal{C}. The method achieves state-of-the-art performance on multiple datasets, runs in real-time (about ms per frame on a single RTX 3090, with faster variants using FastSAM), and demonstrates strong generalization and data efficiency, including open-vocabulary potential. These results suggest a practical pathway to deploy powerful 2D foundation models for embodied perception in real-world robotics and navigation tasks.

Abstract

Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with superior performance, which makes the use of VFM to assist embodied 3D perception a promising direction. However, most existing VFM-assisted 3D perception methods are either offline or too slow that cannot be applied in practical embodied tasks. In this paper, we aim to leverage Segment Anything Model (SAM) for real-time 3D instance segmentation in an online setting. This is a challenging problem since future frames are not available in the input streaming RGB-D video, and an instance may be observed in several frames so object matching between frames is required. To address these challenges, we first propose a geometric-aware query lifting module to represent the 2D masks generated by SAM by 3D-aware queries, which is then iteratively refined by a dual-level query decoder. In this way, the 2D masks are transferred to fine-grained shapes on 3D point clouds. Benefit from the query representation for 3D masks, we can compute the similarity matrix between the 3D masks from different views by efficient matrix operation, which enables real-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScan show our method achieves leading performance even compared with offline methods. Our method also demonstrates great generalization ability in several zero-shot dataset transferring experiments and show great potential in open-vocabulary and data-efficient setting. Code and demo are available at https://xuxw98.github.io/ESAM/, with only one RTX 3090 GPU required for training and evaluation.
Paper Structure (18 sections, 9 equations, 8 figures, 9 tables)

This paper contains 18 sections, 9 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Different from previous 3D SAM methods yang2023sam3dxu2023sampro3dyin2023sai3d that project 2D masks to 3D and merge them with hand-crafted strategies, ESAM lifts 2D masks to 3D queries and iteratively refine them to predict accurate 3D masks. With 3D queries, ESAM is also able to fastly merge 3D masks in different frames with simple matrix operations. Take SAM3D yang2023sam3d for comparison, our ESAM surpasses its performance by 23.2% AP with a more than $20\times$ faster speed.
  • Figure 2: Overview of ESAM. At a new time instant $t$, we first adopt SAM to generate 2D instance masks $M_t^{2d}$. We propose a geometric-aware query lifting module to lift $M_t^{2d}$ to 3D queries $Q_t$ while preserving fine-grained shape information. $Q_t$ are refined by a dual-level decoder, which enables efficient cross-attention and generates fine-grained point-wise masks $M_t^{cur}$ from $Q_t$. Then $M_t^{cur}$ is merged into previous masks $M_{t-1}^{pre}$ by a fast query merging strategy.
  • Figure 3: Details of our efficient query merging strategy. We propose three kinds of representative auxiliary tasks, which generates geometric, contrastive and semantic representations in the form of vectors. Then the similarity matrix can be efficiently computed by matrix multiplication. We further prune the similarity matrix and adopt bipartite matching to merge the instances.
  • Figure 4: Visualization results of different 3D instance segmentation methods on ScanNet200 dataset. As highlighted in red boxes, SAM3D predicts noisy masks while SAI3D tends to over segment an instance into multiple parts.
  • Figure 5: Visualization of the auxiliary tasks for our merging strategy. (a) 3D box prediction for geometric similarity. We visualize the bounding boxes of an object at different time instant. (b) t-SNE visualization of the instance-specific representation for contrastive similarity. Different colors indicate different instances and different points indicate the instance feature at different frames. (c) Query-wise semantic segmentation for semantic similarity.
  • ...and 3 more figures