Table of Contents
Fetching ...

MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation

Yuelong Li, Yafei Mao, Raja Bala, Sunil Hadap

TL;DR

A single-shot approach to determining 6-DoF pose of an object with available 3D computer-aided design model from a single RGB image, outperforming all competing RGB-based methods on four challenging BOP benchmark datasets.

Abstract

We propose a single-shot approach to determining 6-DoF pose of an object with available 3D computer-aided design (CAD) model from a single RGB image. Our method, dubbed MRC-Net, comprises two stages. The first performs pose classification and renders the 3D object in the classified pose. The second stage performs regression to predict fine-grained residual pose within class. Connecting the two stages is a novel multi-scale residual correlation (MRC) layer that captures high-and-low level correspondences between the input image and rendering from first stage. MRC-Net employs a Siamese network with shared weights between both stages to learn embeddings for input and rendered images. To mitigate ambiguity when predicting discrete pose class labels on symmetric objects, we use soft probabilistic labels to define pose class in the first stage. We demonstrate state-of-the-art accuracy, outperforming all competing RGB-based methods on four challenging BOP benchmark datasets: T-LESS, LM-O, YCB-V, and ITODD. Our method is non-iterative and requires no complex post-processing.

MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation

TL;DR

A single-shot approach to determining 6-DoF pose of an object with available 3D computer-aided design model from a single RGB image, outperforming all competing RGB-based methods on four challenging BOP benchmark datasets.

Abstract

We propose a single-shot approach to determining 6-DoF pose of an object with available 3D computer-aided design (CAD) model from a single RGB image. Our method, dubbed MRC-Net, comprises two stages. The first performs pose classification and renders the 3D object in the classified pose. The second stage performs regression to predict fine-grained residual pose within class. Connecting the two stages is a novel multi-scale residual correlation (MRC) layer that captures high-and-low level correspondences between the input image and rendering from first stage. MRC-Net employs a Siamese network with shared weights between both stages to learn embeddings for input and rendered images. To mitigate ambiguity when predicting discrete pose class labels on symmetric objects, we use soft probabilistic labels to define pose class in the first stage. We demonstrate state-of-the-art accuracy, outperforming all competing RGB-based methods on four challenging BOP benchmark datasets: T-LESS, LM-O, YCB-V, and ITODD. Our method is non-iterative and requires no complex post-processing.
Paper Structure (10 sections, 6 equations, 4 figures, 4 tables)

This paper contains 10 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: MRC-Net features a single-shot sequential Siamese structure of two stages, where the second stage conditions on the first classification stage outcome through multi-scale residual correlation of poses between input and rendered images.
  • Figure 2: MRC-Net Architecture. The classifier and regressor stages employ a Siamese structure with shared weights. Both stages take the object crop and its bounding box map as input, and extract image features to detect the visible object mask, which are concatenated together to estimate object pose. The classifier first predicts pose labels. These predictions, along with the 3D CAD model, are then used to render an image estimate, which serves as input for the second stage. Features from the rendered image are correlated with those from real images in the MRC layer. These correlation features undergo ASPP processing within the rendered branch to regress the pose residuals.
  • Figure 3: Comparison of \ref{['subfig:corr_conventional']} conventional approaches lipson2022coupled and \ref{['subfig:corr_proposed']} our proposed approach leveraging feature correlations to estimate residual pose. Instead of predicting an intermediate flow field, we directly feed multi-scale feature correlations into the regression head in an end-to-end fashion. These features are discriminative and eliminate the need for post refinement, outlier removal or multiview renderings lipson2022coupledhai2023shapehu2022perspective.
  • Figure 4: Qualitative comparison of results on T-LESS: (a) Original RGB image, (b) MRC-Net, (c) CosyPose initialized PFA hu2022perspective, (d) SC6D cai2022sc6d, and (e) ZebraPose su2022zebrapose. The object's 3D model is projected with estimated 6D pose and overlaid on original images with distinct colors. Red boxes denote cases where pose predictions are distinctly different across the methods. MRC-Net outperforms the state-of-art models particularly under heavy occlusion. (Best viewed when zoomed in.)