Table of Contents
Fetching ...

RDPN6D: Residual-based Dense Point-wise Network for 6Dof Object Pose Estimation Based on RGB-D Images

Zong-Wei Hong, Yen-Yang Hung, Chu-Song Chen

TL;DR

The paper addresses 6DoF object pose estimation from RGB-D images, focusing on robustness under occlusion where sparse keypoint and direct regression methods struggle. It proposes RDPN6D, a residual-based dense point-wise network that uses dense 2D-3D and 3D-3D correspondences, along with an intrinsic-crop adjustment to generate accurate camera xyz maps. By representing surface coordinates with a set of FPS anchors and residuals, RDPN reduces the output space and improves handling of symmetry and clutter, while a two-branch RGB-D fusion and a pose predictor regress the 6D pose from dense correspondences. On MP6D, YCB-Video, LineMOD, and Occlusion LineMOD, RDPN achieves state-of-the-art results, particularly under heavy occlusion, with an efficient runtime suitable for near real-time applications. The authors provide code at the project URL to facilitate adoption and benchmarking.

Abstract

In this work, we introduce a novel method for calculating the 6DoF pose of an object using a single RGB-D image. Unlike existing methods that either directly predict objects' poses or rely on sparse keypoints for pose recovery, our approach addresses this challenging task using dense correspondence, i.e., we regress the object coordinates for each visible pixel. Our method leverages existing object detection methods. We incorporate a re-projection mechanism to adjust the camera's intrinsic matrix to accommodate cropping in RGB-D images. Moreover, we transform the 3D object coordinates into a residual representation, which can effectively reduce the output space and yield superior performance. We conducted extensive experiments to validate the efficacy of our approach for 6D pose estimation. Our approach outperforms most previous methods, especially in occlusion scenarios, and demonstrates notable improvements over the state-of-the-art methods. Our code is available on https://github.com/AI-Application-and-Integration-Lab/RDPN6D.

RDPN6D: Residual-based Dense Point-wise Network for 6Dof Object Pose Estimation Based on RGB-D Images

TL;DR

The paper addresses 6DoF object pose estimation from RGB-D images, focusing on robustness under occlusion where sparse keypoint and direct regression methods struggle. It proposes RDPN6D, a residual-based dense point-wise network that uses dense 2D-3D and 3D-3D correspondences, along with an intrinsic-crop adjustment to generate accurate camera xyz maps. By representing surface coordinates with a set of FPS anchors and residuals, RDPN reduces the output space and improves handling of symmetry and clutter, while a two-branch RGB-D fusion and a pose predictor regress the 6D pose from dense correspondences. On MP6D, YCB-Video, LineMOD, and Occlusion LineMOD, RDPN achieves state-of-the-art results, particularly under heavy occlusion, with an efficient runtime suitable for near real-time applications. The authors provide code at the project URL to facilitate adoption and benchmarking.

Abstract

In this work, we introduce a novel method for calculating the 6DoF pose of an object using a single RGB-D image. Unlike existing methods that either directly predict objects' poses or rely on sparse keypoints for pose recovery, our approach addresses this challenging task using dense correspondence, i.e., we regress the object coordinates for each visible pixel. Our method leverages existing object detection methods. We incorporate a re-projection mechanism to adjust the camera's intrinsic matrix to accommodate cropping in RGB-D images. Moreover, we transform the 3D object coordinates into a residual representation, which can effectively reduce the output space and yield superior performance. We conducted extensive experiments to validate the efficacy of our approach for 6D pose estimation. Our approach outperforms most previous methods, especially in occlusion scenarios, and demonstrates notable improvements over the state-of-the-art methods. Our code is available on https://github.com/AI-Application-and-Integration-Lab/RDPN6D.
Paper Structure (17 sections, 9 equations, 5 figures, 7 tables)

This paper contains 17 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of our approach. Our method predicts the 3D coordinates of each pixel on the object's surface, resulting in pixel-wise (or dense) correspondence. The object pose is then estimated based on the pixel-wise correspondence.
  • Figure 2: Our purposed residual representation. We use distributedly located anchor points and a fine-level residual vector (to the nearest anchor) to map each point on the object's surface. This eliminates the need for the network to directly predict the exact coordinates where the range is extensively large, and makes the correspondence prediction more robust.
  • Figure 3: Framework of RDPN.(i) Starting with an RGB-D image, our initial step involves utilizing the outcomes of object detection to crop the region of interest (ROI), which results in a zoomed-in view ($\mathcal{I}_{rgb}, \mathcal{I}_{depth}$), In order to obtain the accurate projected camera xyz map $\mathcal{I}_{C_{xyz}}$, it is necessary to adjust the original camera intrinsic $\textbf{K}_{org}$ to $\textbf{K}_{crop}$. (ii) Once we have prepared the $\mathcal{I}_{rgb}$ and $\mathcal{I}_{C_{xyz}}$, the RGB-D feature extractor is responsible for capturing the RGB-D fusion features $\mathcal{F}_{rgbd}$, and feed them into a feature decoder to obtain both the mask ($\mathcal{F}_{mask}$) and per-pixel prediction to the point coordinates in the 3D model of the object. This includes a ($\textit{K}+1$)-dimensional region probability ($\mathcal{F}_{region}$), 3-dimensional corresponding nearest anchors, and the residual vector ($\mathcal{F}_{residual}$). (iii) Finally, based on the mask and object coordinates, we utilize an image uv map ($\mathcal{I}_{UV}$) and a downsampled camera xyz map ($\mathcal{I}_{C_{xyz64}}$) to establish dense correspondences. These correspondences are then input into the pose predictor to regress the object pose R and t.
  • Figure 4: Qualitative results on Occlusion LineMOD. The images are rendered by projecting the 3D object model onto the image plane using the estimated pose. Our two-step dense correspondence method can accurately capture the object and predict its pose, even under heavy occlusion. This contrasts previous keypoint-based methods, which often struggle in such scenarios.
  • Figure 5: Ablation study on the number of anchors. The 10$^{\circ}$, 10 cm metric measures whether the rotation and the translation error is less than 10$^{\circ}$ and 10 cm, respectively.