Table of Contents
Fetching ...

W-PoseNet: Dense Correspondence Regularized Pixel Pair Pose Regression

Zelin Xu, Ke Chen, Kui Jia

TL;DR

A novel pose estimation algorithm W-PoseNet is introduced, which densely regresses from input data to 6D pose and also 3D coordinates in model space and a sparse pair combination of pixel-wise features and soft voting on pixel-pair pose predictions are designed to improve robustness to inconsistent and sparse local features.

Abstract

Solving 6D pose estimation is non-trivial to cope with intrinsic appearance and shape variation and severe inter-object occlusion, and is made more challenging in light of extrinsic large illumination changes and low quality of the acquired data under an uncontrolled environment. This paper introduces a novel pose estimation algorithm W-PoseNet, which densely regresses from input data to 6D pose and also 3D coordinates in model space. In other words, local features learned for pose regression in our deep network are regularized by explicitly learning pixel-wise correspondence mapping onto 3D pose-sensitive coordinates as an auxiliary task. Moreover, a sparse pair combination of pixel-wise features and soft voting on pixel-pair pose predictions are designed to improve robustness to inconsistent and sparse local features. Experiment results on the popular YCB-Video and LineMOD benchmarks show that the proposed W-PoseNet consistently achieves superior performance to the state-of-the-art algorithms.

W-PoseNet: Dense Correspondence Regularized Pixel Pair Pose Regression

TL;DR

A novel pose estimation algorithm W-PoseNet is introduced, which densely regresses from input data to 6D pose and also 3D coordinates in model space and a sparse pair combination of pixel-wise features and soft voting on pixel-pair pose predictions are designed to improve robustness to inconsistent and sparse local features.

Abstract

Solving 6D pose estimation is non-trivial to cope with intrinsic appearance and shape variation and severe inter-object occlusion, and is made more challenging in light of extrinsic large illumination changes and low quality of the acquired data under an uncontrolled environment. This paper introduces a novel pose estimation algorithm W-PoseNet, which densely regresses from input data to 6D pose and also 3D coordinates in model space. In other words, local features learned for pose regression in our deep network are regularized by explicitly learning pixel-wise correspondence mapping onto 3D pose-sensitive coordinates as an auxiliary task. Moreover, a sparse pair combination of pixel-wise features and soft voting on pixel-pair pose predictions are designed to improve robustness to inconsistent and sparse local features. Experiment results on the popular YCB-Video and LineMOD benchmarks show that the proposed W-PoseNet consistently achieves superior performance to the state-of-the-art algorithms.

Paper Structure

This paper contains 11 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Visualization of our $\mathcal{W}$-PoseNet in comparison with its competitor DenseFusion wang2019densefusion. The key differences lie in that our $\mathcal{W}$-PoseNet introduces pixel pair pose estimation based on a low-rank bilinear pooling (LRBP, highlighted in a green block) and dense correspondence mapping (DCM) from each pixel to its 3D coordinate (highlighted in yellow blocks), whose geometric structure is similar to the font $\mathcal{W}$ highlighted in orange. Illustrative examples at the bottom rows are from the YCB-Video xiang2017posecnn.
  • Figure 2: Pipeline of the proposed $\mathcal{W}$-PoseNet. The method first detects and segments the foreground containing object instances on RGB images. RGB and depth images are respectively fed into feature encoders and then fused with the PointNet++ qi2017pointnet++. Pixel-wise features are sparsely sampled and combined to generate pixel pair features, which produce 6D pose $[\bm{R}|\bm{t}]$. The branch about Dense Correspondence Mapping is to regress pose-specific 3D coordinates from per-pixel features, providing additional geometric constraints for feature learning. A joint loss on dense correspondence mapping and pixel-pair pose regression branches are used to supervise network training.
  • Figure 3: Dense Correspondence Mapping (DCM). This module aims to regularizing feature learning in pose regression with mapping onto 3D coordinates sensitive to 6D poses. Specifically, the point cloud in blue is generated by transforming the point clouds sampled from depth image into the model coordinate system, which are used as supervision signals in DCM.
  • Figure 4: Comparison of our $\mathcal{W}$-PoseNet and two state-of-the-art methods under different degree of occlusion.