Table of Contents
Fetching ...

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Jiehong Lin, Lihua Liu, Dekun Lu, Kui Jia

TL;DR

SAM-6D, a novel framework designed to realize the task of zero-shot 6D object pose estimation through two steps, including instance segmentation and pose estimation of novel objects, outperforms the existing methods on the seven core datasets of the BOP Benchmark.

Abstract

Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

TL;DR

SAM-6D, a novel framework designed to realize the task of zero-shot 6D object pose estimation through two steps, including instance segmentation and pose estimation of novel objects, outperforms the existing methods on the seven core datasets of the BOP Benchmark.

Abstract

Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.
Paper Structure (40 sections, 17 equations, 10 figures, 12 tables)

This paper contains 40 sections, 17 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: We present SAM-6D for zero-shot 6D object pose estimation. SAM-6D takes an RGB image (a) and a depth map (b) of a cluttered scene as inputs, and performs instance segmentation (d) and pose estimation (e) for novel objects (c). We present the qualitative results of SAM-6D on the seven core datasets of the BOP benchmark BOP, including YCB-V, LM-O, HB, T-LESS, IC-BIN, ITODD and TUD-L, arranged from left to right. Best view in the electronic version.
  • Figure 2: An overview of our proposed SAM-6D, which consists of an Instance Segmentation Model (ISM) and a Pose Estimation Model (PEM) for joint instance segmentation and pose estimation of novel objects in RGB-D images. ISM leverages the Segment Anything Model (SAM) SAM to generate all possible proposals and selectively retains valid ones based on object matching scores. PEM involves two stages of point matching, from coarse to fine, to establish 3D-3D correspondence and calculate object poses for all valid proposals. Best view in the electronic version.
  • Figure 3: An illustration of Pose Estimation Model (PEM) of SAM-6D.
  • Figure 4: Qualitative results of our Instance Segmentation Model with or without the appearance matching score $s_{appe}$.
  • Figure 5: Qualitative results of our Instance Segmentation Model with or without the geometric matching score $s_{geo }$.
  • ...and 5 more figures