Table of Contents
Fetching ...

MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images

Junwen Huang, Hao Yu, Kuan-Ting Yu, Nassir Navab, Slobodan Ilic, Benjamin Busam

TL;DR

MatchU tackles unseen-object 6D pose estimation by learning generic, rotation-invariant 3D descriptors for CAD and depth, and fusing 2D texture through a Latent Fusion Attention module. A Bridged Coarse-level Matching Loss leverages RGB features to align cross-modal latent spaces, while a fine-level Sinkhorn-based matching refines correspondences to estimate pose with RANSAC. It demonstrates state-of-the-art accuracy and speed on five BOP core datasets for unseen objects, without retraining on test objects, and shows robust symmetry handling through texture-enabled descriptors. The approach outperforms several RGB-D fusion baselines and offers a scalable path toward practical unseen-object 6D pose estimation in robotics and AR, albeit relying on external segmentation modules that could be integrated end-to-end in the future.

Abstract

Recent learning methods for object pose estimation require resource-intensive training for each individual object instance or category, hampering their scalability in real applications when confronted with previously unseen objects. In this paper, we propose MatchU, a Fuse-Describe-Match strategy for 6D pose estimation from RGB-D images. MatchU is a generic approach that fuses 2D texture and 3D geometric cues for 6D pose prediction of unseen objects. We rely on learning geometric 3D descriptors that are rotation-invariant by design. By encoding pose-agnostic geometry, the learned descriptors naturally generalize to unseen objects and capture symmetries. To tackle ambiguous associations using 3D geometry only, we fuse additional RGB information into our descriptor. This is achieved through a novel attention-based mechanism that fuses cross-modal information, together with a matching loss that leverages the latent space learned from RGB data to guide the descriptor learning process. Extensive experiments reveal the generalizability of both the RGB-D fusion strategy as well as the descriptor efficacy. Benefiting from the novel designs, MatchU surpasses all existing methods by a significant margin in terms of both accuracy and speed, even without the requirement of expensive re-training or rendering.

MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images

TL;DR

MatchU tackles unseen-object 6D pose estimation by learning generic, rotation-invariant 3D descriptors for CAD and depth, and fusing 2D texture through a Latent Fusion Attention module. A Bridged Coarse-level Matching Loss leverages RGB features to align cross-modal latent spaces, while a fine-level Sinkhorn-based matching refines correspondences to estimate pose with RANSAC. It demonstrates state-of-the-art accuracy and speed on five BOP core datasets for unseen objects, without retraining on test objects, and shows robust symmetry handling through texture-enabled descriptors. The approach outperforms several RGB-D fusion baselines and offers a scalable path toward practical unseen-object 6D pose estimation in robotics and AR, albeit relying on external segmentation modules that could be integrated end-to-end in the future.

Abstract

Recent learning methods for object pose estimation require resource-intensive training for each individual object instance or category, hampering their scalability in real applications when confronted with previously unseen objects. In this paper, we propose MatchU, a Fuse-Describe-Match strategy for 6D pose estimation from RGB-D images. MatchU is a generic approach that fuses 2D texture and 3D geometric cues for 6D pose prediction of unseen objects. We rely on learning geometric 3D descriptors that are rotation-invariant by design. By encoding pose-agnostic geometry, the learned descriptors naturally generalize to unseen objects and capture symmetries. To tackle ambiguous associations using 3D geometry only, we fuse additional RGB information into our descriptor. This is achieved through a novel attention-based mechanism that fuses cross-modal information, together with a matching loss that leverages the latent space learned from RGB data to guide the descriptor learning process. Extensive experiments reveal the generalizability of both the RGB-D fusion strategy as well as the descriptor efficacy. Benefiting from the novel designs, MatchU surpasses all existing methods by a significant margin in terms of both accuracy and speed, even without the requirement of expensive re-training or rendering.
Paper Structure (20 sections, 5 equations, 6 figures, 3 tables)

This paper contains 20 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: MatchU provides a pipeline to match a previously unseen 3D CAD model of an object to an RGBD image (Top left). (Fuse) Information from RGB-D and CAD is fused. (Describe) Consumes fused information and produces generic color-aware rotation-invariant 3D descriptors. (Match) Further used for establishing correspondences as well as the 6D pose.
  • Figure 2: Overview of MatchU. Upon encountering an unseen object, we initially derive the segmented depth point cloud $Q$ and the corresponding RGB image crop $K$ utilizing a pre-trained generic segmentation network. Subsequently, we procure both 3D and 2D local features from the CAD point cloud $P$, depth point cloud $Q$, and the RGB image crop $K$. These extracted features are then amalgamated within a latent space through our innovative Latent Fusion Attention Module, under the guidance of a Bridged Coarse-level Matching Loss (BCM Loss)$\mathcal{L}{c}^{PKQ}$. The refined 3D descriptors $\widetilde{\phi}^{P'}$ and $\widetilde{\phi}^{Q'}$ are fed into decoders, which enhance the resolution of the descriptors to ${\phi}^{P}$ and ${\phi}^{Q}$, this process being steered by a detailed matching loss $\mathcal{L}^{PQ}_{f}$. In the final stage, the 6D pose of the novel objects is deduced by aligning the descriptors within the latent space and aggregating the pose parameters $\mathcal{T}$.
  • Figure 3: t-SNE visualization of our descriptors for symmetric objects. We showcase the capability of capturing both continuous and discrete symmetries without external annotation.
  • Figure 4: t-SNE visualization of a regular tetrahedron with different colored faces (left). Our method can extract distinct descriptors for the red face (right) while RoITr cannot (center).
  • Figure 5: Qualitative results of 6D pose estimation of our method in comparison with Megapose and ZeroPose. The upper row shows an egg box which is heavily occluded in the LM-O dataset. Our method is robust to handle occlusion while other methods flip the poses by mistake. The lower row shows a highly ambiguous object that other methods put the pose upside down but ours predicts accurately.
  • ...and 1 more figures