Table of Contents
Fetching ...

UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image

Xingyu Liu, Gu Wang, Ruida Zhang, Chenyangguang Zhang, Federico Tombari, Xiangyang Ji

TL;DR

UNOPose tackles unseen object 6DoF pose estimation from a single unposed RGB-D reference. It introduces an SE(3)-invariant global reference frame (GRF) and a local reference frame (LRF) to standardize representations, plus an overlap predictor to handle partial overlap, and employs a coarse-to-fine registration pipeline. The approach achieves state-of-the-art results in one-reference settings and remains competitive with CAD-model-based methods, validated on a new BOP-based benchmark with real-world datasets. This work reduces onboarding costs for novel objects and enables robust pose estimation in open-world scenarios, with potential extensions toward reconstructing unseen objects from a single reference.

Abstract

Unseen object pose estimation methods often rely on CAD models or multiple reference views, making the onboarding stage costly. To simplify reference acquisition, we aim to estimate the unseen object's pose through a single unposed RGB-D reference image. While previous works leverage reference images as pose anchors to limit the range of relative pose, our scenario presents significant challenges since the relative transformation could vary across the entire SE(3) space. Moreover, factors like occlusion, sensor noise, and extreme geometry could result in low viewpoint overlap. To address these challenges, we present a novel approach and benchmark, termed UNOPose, for unseen one-reference-based object pose estimation. Building upon a coarse-to-fine paradigm, UNOPose constructs an SE(3)-invariant reference frame to standardize object representation despite pose and size variations. To alleviate small overlap across viewpoints, we recalibrate the weight of each correspondence based on its predicted likelihood of being within the overlapping region. Evaluated on our proposed benchmark based on the BOP Challenge, UNOPose demonstrates superior performance, significantly outperforming traditional and learning-based methods in the one-reference setting and remaining competitive with CAD-model-based methods. The code and dataset are available at https://github.com/shanice-l/UNOPose.

UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image

TL;DR

UNOPose tackles unseen object 6DoF pose estimation from a single unposed RGB-D reference. It introduces an SE(3)-invariant global reference frame (GRF) and a local reference frame (LRF) to standardize representations, plus an overlap predictor to handle partial overlap, and employs a coarse-to-fine registration pipeline. The approach achieves state-of-the-art results in one-reference settings and remains competitive with CAD-model-based methods, validated on a new BOP-based benchmark with real-world datasets. This work reduces onboarding costs for novel objects and enables robust pose estimation in open-world scenarios, with potential extensions toward reconstructing unseen objects from a single reference.

Abstract

Unseen object pose estimation methods often rely on CAD models or multiple reference views, making the onboarding stage costly. To simplify reference acquisition, we aim to estimate the unseen object's pose through a single unposed RGB-D reference image. While previous works leverage reference images as pose anchors to limit the range of relative pose, our scenario presents significant challenges since the relative transformation could vary across the entire SE(3) space. Moreover, factors like occlusion, sensor noise, and extreme geometry could result in low viewpoint overlap. To address these challenges, we present a novel approach and benchmark, termed UNOPose, for unseen one-reference-based object pose estimation. Building upon a coarse-to-fine paradigm, UNOPose constructs an SE(3)-invariant reference frame to standardize object representation despite pose and size variations. To alleviate small overlap across viewpoints, we recalibrate the weight of each correspondence based on its predicted likelihood of being within the overlapping region. Evaluated on our proposed benchmark based on the BOP Challenge, UNOPose demonstrates superior performance, significantly outperforming traditional and learning-based methods in the one-reference setting and remaining competitive with CAD-model-based methods. The code and dataset are available at https://github.com/shanice-l/UNOPose.

Paper Structure

This paper contains 15 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of unseen object pose estimation. Given a query image presenting a target object unseen during training, we aim to estimate its segmentation and 6DoF pose w.r.t. a reference frame. While previous methods labbe2023megaposefoundationposewen2024liu2022gen6dsun2022onepose often rely on the CAD model or multiple RGB(-D) images for reference, we merely use one unposed RGB-D reference image.
  • Figure 2: The network architecture of UNOPose. Given the query and reference point clouds $\mathbf{Q}_{cam}$ and $\mathbf{P}_{cam}$ in the camera frame, UNOPose first transforms them into the $SE(3)$-invariant global reference frame (GRF). Then feature descriptors are extracted from sparse point sets for constructing the coarse correlation matrix. For achieving precise correspondences, the fine pose estimation module exploits structural details using positional encoding and local reference frame (LRF) encoding.
  • Figure 3: Ablation of initial rotation distance between query and reference objects. We categorize all testing objects into nine groups according to the initial rotation distance, and evaluate the $\text{AR}_{\text{BOP}}$ metric and overlap ratio for each group separately.