Table of Contents
Fetching ...

DVMNet++: Rethinking Relative Pose Estimation for Unseen Objects

Chen Zhao, Tong Zhang, Zheng Dang, Mathieu Salzmann

TL;DR

DVMNet++ tackles the challenging problem of estimating the relative pose between a query image and a reference image for previously unseen objects without relying on ground-truth bounding boxes or a large set of rotation hypotheses. The method combines an open-set, text-guided object detector with a deep voxel matching network that lifts RGB images to 3D voxel embeddings and performs end-to-end, differentiable rotation estimation. A Weighted Closest Voxel (WCV) algorithm provides robustness against outliers by weighting voxel correspondences using objectness and masks, enabling accurate rotation in a single pass. Across CO3D, Objaverse, LINEMOD, and LINEMOD-O datasets, DVMNet++ achieves improved accuracy and lower computation relative to state-of-the-art methods, and demonstrates strong robustness to occlusion and sparse references. The approach reduces dependency on precise detections and dense references, offering practical advantages for generalizable 6D pose estimation in cluttered or open-world scenarios.

Abstract

Determining the relative pose of a previously unseen object between two images is pivotal to the success of generalizable object pose estimation. Existing approaches typically predict 3D translation utilizing the ground-truth object bounding box and approximate 3D rotation with a large number of discrete hypotheses. This strategy makes unrealistic assumptions about the availability of ground truth and incurs a computationally expensive process of scoring each hypothesis at test time. By contrast, we rethink the problem of relative pose estimation for unseen objects by presenting a Deep Voxel Matching Network (DVMNet++). Our method computes the relative object pose in a single pass, eliminating the need for ground-truth object bounding boxes and rotation hypotheses. We achieve open-set object detection by leveraging image feature embedding and natural language understanding as reference. The detection result is then employed to approximate the translation parameters and crop the object from the query image. For rotation estimation, we map the two RGB images, i.e., reference and cropped query, to their respective voxelized 3D representations. The resulting voxels are passed through a rotation estimation module, which aligns the voxels and computes the rotation in an end-to-end fashion by solving a least-squares problem. To enhance robustness, we introduce a weighted closest voxel algorithm capable of mitigating the impact of noisy voxels. We conduct extensive experiments on the CO3D, Objaverse, LINEMOD, and LINEMOD-O datasets, demonstrating that our approach delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods. Our code is released at https://github.com/sailor-z/DVMNet/.

DVMNet++: Rethinking Relative Pose Estimation for Unseen Objects

TL;DR

DVMNet++ tackles the challenging problem of estimating the relative pose between a query image and a reference image for previously unseen objects without relying on ground-truth bounding boxes or a large set of rotation hypotheses. The method combines an open-set, text-guided object detector with a deep voxel matching network that lifts RGB images to 3D voxel embeddings and performs end-to-end, differentiable rotation estimation. A Weighted Closest Voxel (WCV) algorithm provides robustness against outliers by weighting voxel correspondences using objectness and masks, enabling accurate rotation in a single pass. Across CO3D, Objaverse, LINEMOD, and LINEMOD-O datasets, DVMNet++ achieves improved accuracy and lower computation relative to state-of-the-art methods, and demonstrates strong robustness to occlusion and sparse references. The approach reduces dependency on precise detections and dense references, offering practical advantages for generalizable 6D pose estimation in cluttered or open-world scenarios.

Abstract

Determining the relative pose of a previously unseen object between two images is pivotal to the success of generalizable object pose estimation. Existing approaches typically predict 3D translation utilizing the ground-truth object bounding box and approximate 3D rotation with a large number of discrete hypotheses. This strategy makes unrealistic assumptions about the availability of ground truth and incurs a computationally expensive process of scoring each hypothesis at test time. By contrast, we rethink the problem of relative pose estimation for unseen objects by presenting a Deep Voxel Matching Network (DVMNet++). Our method computes the relative object pose in a single pass, eliminating the need for ground-truth object bounding boxes and rotation hypotheses. We achieve open-set object detection by leveraging image feature embedding and natural language understanding as reference. The detection result is then employed to approximate the translation parameters and crop the object from the query image. For rotation estimation, we map the two RGB images, i.e., reference and cropped query, to their respective voxelized 3D representations. The resulting voxels are passed through a rotation estimation module, which aligns the voxels and computes the rotation in an end-to-end fashion by solving a least-squares problem. To enhance robustness, we introduce a weighted closest voxel algorithm capable of mitigating the impact of noisy voxels. We conduct extensive experiments on the CO3D, Objaverse, LINEMOD, and LINEMOD-O datasets, demonstrating that our approach delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods. Our code is released at https://github.com/sailor-z/DVMNet/.
Paper Structure (21 sections, 17 equations, 10 figures, 9 tables)

This paper contains 21 sections, 17 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Advantages of our DVMNet++ compared to hypothesis-based methods. Hypothesis-based techniques approximate the relative object rotation by scoring numerous rotation hypotheses, leading to a high computational cost. By contrast, our DVMNet++ computes the rotation in a hypothesis-free fashion by robustly matching voxelized 3D representations of the reference and query images via a Weighted Closest Voxel algorithm. Our method strikes a favorable balance between computational cost and accuracy in relative object pose estimation, as measured by multiply-accumulate operations (MACs) and angular error.
  • Figure 2: Problem formulation. (a) Input to our method, consisting of a query image and a reference image. (b) Our goal is to identify the corresponding object in the query image and estimate the object translation and rotation based on the reference image. We represent the predicted translation and rotation as a bounding box and green arrows, respectively.
  • Figure 3: Open-set object detection. We incorporate an open-set object detection module in our relative object pose estimation framework, utilizing multi-modal reference information. Given the reference image, we describe the object appearance using text prompts. An open-vocabulary object detection network takes these prompts and the query image as input, and predicts a set of object proposals. Since the generated proposals may include outliers, we propose identifying the most reliable prediction using an image retrieval technique. We encode the reference image and proposals to feature descriptors by utilizing a pretrained DINOv2 encoder. The final detection result is determined as the proposal with the highest cosine similarity score.
  • Figure 4: Network architecture of our autoencoder. The encoder takes two RGB images, query and reference, as input and lifts their 2D feature embeddings to 3D voxels by leveraging cross-view 3D information. $\mathbf{O}_q$ and $\mathbf{O}_r$ represent the learned 3D objectness maps account for robust object rotation estimation. The decoder then reconstructs the masked object images from the voxels, allowing the voxels to encode the object patterns.
  • Figure 5: Computing relative object rotation from 3D voxels. The feature similarities of $\mathbf{V}_q$ and $\mathbf{V}_r$ are computed, which results in a score matrix $\mathbf{S}$. A soft assignment is performed based on $\mathbf{S}$ over the query object mask $\hat{\mathbf{M}}_q$, the 3D objectness map $\mathbf{O}_q$, and the 3D coordinates $\mathbf{X}_q$. The aligned query and reference voxels are then fed into a Weighted Closest Voxel (WCV) algorithm that estimates the relative object rotation in a robust and end-to-end manner.
  • ...and 5 more figures