Table of Contents
Fetching ...

AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation

Yang Zou, Zhaoshuai Qi, Yating Liu, Zihao Xu, Weipeng Sun, Weiyi Liu, Xingyuan Li, Jiaqi Yang, Yanning Zhang

TL;DR

AxisPose rethinks 6D pose estimation by eliminating dependence on CAD models, depth, or multi-view references. It comerciais a diffusion-based Axis Generation Module to learn a latent 2D tri-axis pose representation from a single RGB image, followed by a Triaxial Back-projection Module that recovers the 6D pose from the generated axes. A geometric consistency loss guides the diffusion process, injecting its gradient into the noise estimation to enforce geometric plausibility. Experiments on LINEMOD and YCB-Video show competitive performance among model-free methods with strong robustness to occlusion and texture, highlighting the potential of a generative, matching-free approach for cross-instance pose estimation. The work points to promising directions for extending to unseen objects and improving cross-instance generalization while maintaining a model-free, single-shot paradigm.

Abstract

Object pose estimation, which plays a vital role in robotics, augmented reality, and autonomous driving, has been of great interest in computer vision. Existing studies either require multi-stage pose regression or rely on 2D-3D feature matching. Though these approaches have shown promising results, they rely heavily on appearance information, requiring complex input (i.e., multi-view reference input, depth, or CAD models) and intricate pipeline (i.e., feature extraction-SfM-2D to 3D matching-PnP). We propose AxisPose, a model-free, matching-free, single-shot solution for robust 6D pose estimation, which fundamentally diverges from the existing paradigm. Unlike existing methods that rely on 2D-3D or 2D-2D matching using 3D techniques, such as SfM and PnP, AxisPose directly infers a robust 6D pose from a single view by leveraging a diffusion model to learn the latent axis distribution of objects without reference views. Specifically, AxisPose constructs an Axis Generation Module (AGM) to capture the latent geometric distribution of object axes through a diffusion model. The diffusion process is guided by injecting the gradient of geometric consistency loss into the noise estimation to maintain the geometric consistency of the generated tri-axis. With the generated tri-axis projection, AxisPose further adopts a Triaxial Back-projection Module (TBM) to recover the 6D pose from the object tri-axis. The proposed AxisPose achieves robust performance at the cross-instance level (i.e., one model for N instances) using only a single view as input without reference images, with great potential for generalization to unseen-object level.

AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation

TL;DR

AxisPose rethinks 6D pose estimation by eliminating dependence on CAD models, depth, or multi-view references. It comerciais a diffusion-based Axis Generation Module to learn a latent 2D tri-axis pose representation from a single RGB image, followed by a Triaxial Back-projection Module that recovers the 6D pose from the generated axes. A geometric consistency loss guides the diffusion process, injecting its gradient into the noise estimation to enforce geometric plausibility. Experiments on LINEMOD and YCB-Video show competitive performance among model-free methods with strong robustness to occlusion and texture, highlighting the potential of a generative, matching-free approach for cross-instance pose estimation. The work points to promising directions for extending to unseen objects and improving cross-instance generalization while maintaining a model-free, single-shot paradigm.

Abstract

Object pose estimation, which plays a vital role in robotics, augmented reality, and autonomous driving, has been of great interest in computer vision. Existing studies either require multi-stage pose regression or rely on 2D-3D feature matching. Though these approaches have shown promising results, they rely heavily on appearance information, requiring complex input (i.e., multi-view reference input, depth, or CAD models) and intricate pipeline (i.e., feature extraction-SfM-2D to 3D matching-PnP). We propose AxisPose, a model-free, matching-free, single-shot solution for robust 6D pose estimation, which fundamentally diverges from the existing paradigm. Unlike existing methods that rely on 2D-3D or 2D-2D matching using 3D techniques, such as SfM and PnP, AxisPose directly infers a robust 6D pose from a single view by leveraging a diffusion model to learn the latent axis distribution of objects without reference views. Specifically, AxisPose constructs an Axis Generation Module (AGM) to capture the latent geometric distribution of object axes through a diffusion model. The diffusion process is guided by injecting the gradient of geometric consistency loss into the noise estimation to maintain the geometric consistency of the generated tri-axis. With the generated tri-axis projection, AxisPose further adopts a Triaxial Back-projection Module (TBM) to recover the 6D pose from the object tri-axis. The proposed AxisPose achieves robust performance at the cross-instance level (i.e., one model for N instances) using only a single view as input without reference images, with great potential for generalization to unseen-object level.

Paper Structure

This paper contains 15 sections, 14 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Existing methods rely on direct 2D-3D matching from input CAD models (e.g., instance-level methods) or depth data (e.g., category-level methods) or indirectly from multiple supporting views (e.g., unseen-object methods). In contrast, we hypothesize that each object possesses a tri-axis intrinsic 2D pose representation that reflects its 3D characteristics, making feature matching unnecessary. Based on this idea, we propose inferring the 6D pose in a model-free, matching-free, and single-shot manner by learning the tri-axis as a 2D latent pose representation. We provide a visual comparison with two instance-level methods (CheckerPose lian2023checkerpose, DProST park2022dprost) and three unseen-object methods (NOPE nguyen2024nope, OnePose++he2022onepose++ with 8 reference views, and Gen6Dliu2022gen6d with 50 reference views), all retrained in an instance-level manner for fair comparison. The reprojection errors, measured in pixels, are shown in the top right corner.
  • Figure 2: Overview of AxisPose. Given a reference image, the geometric consistency guided Axis Generation Module (AGM) first generates the 2D axes projection. Then, the Triaxial Back-projection Module (TBM) reconstructs the 6D pose from it.
  • Figure 3: Visualization of instances used for training and testing. The first seven instances are from the LINEMOD dataset hinterstoisser2012model, and the remaining three instances come from the YCB-Video dataset xiang2017posecnn.
  • Figure 4: Qualitative results. The green bounding boxes indicate the ground-truth poses, while the red bounding boxes represent the predicted poses. Our method achieves satisfactory performance across various instances and remains robust against degradation of weak texture and occlusion conditions.
  • Figure 5: Qualitative ablation of geometric consistency loss.
  • ...and 1 more figures