Table of Contents
Fetching ...

Novel Object 6D Pose Estimation with a Single Reference View

Jian Liu, Wei Sun, Kai Zeng, Jin Zheng, Hui Yang, Hossein Rahmani, Ajmal Mian, Lin Wang

TL;DR

This work tackles novel object 6D pose estimation using only a single reference view, addressing the scalability limitations of CAD-model and dense-reference approaches. It introduces SinRef-6D, which performs iterative object-space point-wise alignment guided by RGB and Points State Space Models (SSMs) to capture long-range spatial information with linear complexity, followed by pose solving via weighted SVD. Key contributions include the integration of RGB and Points SSMs, an iterative focalization-and-alignment pipeline with two non-shared GeoTransformers, and strong empirical results across six public datasets and real-world scenes, demonstrating CAD-free performance competitive with CAD-based methods. The method offers practical impact for mobile and robotic deployments by removing the need for textured CAD models or dense reference views, enabling scalable 6D pose estimation in unseen objects. Limitations include challenges with top-down views and reflective materials, pointing to future work on robustness in such scenarios.

Abstract

Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in a common coordinate system based on state space models (SSMs). Specifically, iterative object-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at https://github.com/CNJianLiu/SinRef-6D.

Novel Object 6D Pose Estimation with a Single Reference View

TL;DR

This work tackles novel object 6D pose estimation using only a single reference view, addressing the scalability limitations of CAD-model and dense-reference approaches. It introduces SinRef-6D, which performs iterative object-space point-wise alignment guided by RGB and Points State Space Models (SSMs) to capture long-range spatial information with linear complexity, followed by pose solving via weighted SVD. Key contributions include the integration of RGB and Points SSMs, an iterative focalization-and-alignment pipeline with two non-shared GeoTransformers, and strong empirical results across six public datasets and real-world scenes, demonstrating CAD-free performance competitive with CAD-based methods. The method offers practical impact for mobile and robotic deployments by removing the need for textured CAD models or dense reference views, enabling scalable 6D pose estimation in unseen objects. Limitations include challenges with top-down views and reflective materials, pointing to future work on robustness in such scenarios.

Abstract

Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in a common coordinate system based on state space models (SSMs). Specifically, iterative object-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at https://github.com/CNJianLiu/SinRef-6D.

Paper Structure

This paper contains 17 sections, 8 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Comparison of manual reference view-based novel object 6D pose estimation methods. (a) Dense reference views-based methods typically rely on ①: 3D object reconstruction or ②: template matching, which is time- and storage-consuming. (b) The proposed SinRef-6D estimates novel object pose using only a single reference view, providing enhanced efficiency and scalability.
  • Figure 2: Our proposed SinRef-6D framework. Given a normal RGB-D reference view of a novel object, we aim to predict its 6D pose from any query view. SinRef-6D comprises four modules: (A) RGB-D images from the reference and query views are segmented, and the segmented depth maps are back-projected into point clouds. (B) The corresponding point clouds from the reference and query views are focalized from the camera coordinate system to the object coordinate system. (C) Leveraging the proposed Points and RGB SSMs (details are shown in Fig. \ref{['Fig.PSS']} and Fig. \ref{['Fig.VSS']}), features are extracted from the focalized point clouds and RGB images, forming point-wise reference and query features. (D) These features are then used to establish point-wise alignment to solve the object pose. Finally, the computed pose is fed back into module (B) to iteratively improve the accuracy of the point-wise alignment, yielding a more precise object pose.
  • Figure 3: Detailed architecture of the proposed Points SSM.
  • Figure 4: Detailed architecture of the proposed RGB SSM.
  • Figure 5: The qualitative comparison results on the LineMod dataset linemod are presented, visualizing the outputs of Gen6D gen6d, our SinRef-6D, and ground truth from top to bottom.
  • ...and 7 more figures