Table of Contents
Fetching ...

Extending 6D Object Pose Estimators for Stereo Vision

Thomas Pöllabauer, Jan Emrich, Volker Knauthe, Arjan Kuijper

TL;DR

This work created a BOP compatible stereo version of the YCB-V dataset and outperforms state-of-the-art 6D pose estimation algorithms by utilizing stereo vision and can be adopted for other dense feature-based algorithms.

Abstract

Estimating the 6D pose of objects accurately, quickly, and robustly remains a difficult task. However, recent methods for directly regressing poses from RGB images using dense features have achieved state-of-the-art results. Stereo vision, which provides an additional perspective on the object, can help reduce pose ambiguity and occlusion. Moreover, stereo can directly infer the distance of an object, while mono-vision requires internalized knowledge of the object's size. To extend the state-of-the-art in 6D object pose estimation to stereo, we created a BOP compatible stereo version of the YCB-V dataset. Our method outperforms state-of-the-art 6D pose estimation algorithms by utilizing stereo vision and can easily be adopted for other dense feature-based algorithms.

Extending 6D Object Pose Estimators for Stereo Vision

TL;DR

This work created a BOP compatible stereo version of the YCB-V dataset and outperforms state-of-the-art 6D pose estimation algorithms by utilizing stereo vision and can be adopted for other dense feature-based algorithms.

Abstract

Estimating the 6D pose of objects accurately, quickly, and robustly remains a difficult task. However, recent methods for directly regressing poses from RGB images using dense features have achieved state-of-the-art results. Stereo vision, which provides an additional perspective on the object, can help reduce pose ambiguity and occlusion. Moreover, stereo can directly infer the distance of an object, while mono-vision requires internalized knowledge of the object's size. To extend the state-of-the-art in 6D object pose estimation to stereo, we created a BOP compatible stereo version of the YCB-V dataset. Our method outperforms state-of-the-art 6D pose estimation algorithms by utilizing stereo vision and can easily be adopted for other dense feature-based algorithms.
Paper Structure (18 sections, 3 figures, 1 table)

This paper contains 18 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Details on some proposed architectures for feature fusion at different pipeline stages. Exemplified on the SO-Pose algorithm, but also applicable to GDRN. a) Early fusion. Feature maps are merged after individual forward propagation through the backbone. b) Mid-level fusion. After two separate forward passes the embeddings are concatenated and additional convolutional layers are added inside the $PnP$ Net to cope with the additional dimensionality. c) Late fusion. Features are fused only after the CNN inside the $PnP$ network but before the multi-layer perceptron / dense network. d) Double fusion. Here we re-use the late fusion $PnP$ Net and combine it with an embedding mixing scheme where half of the feature maps are swapped between left and right. This way each individual forward pass has information from both images. e) Early + shared backbone based disparity prediction. We add a disparity prediction using the feature maps as extracted from our common backbone. Disparity prediction is done both ways in a symmetric fashion. The additional features are again concatenated and fed to the $PnP$ network for pose regression.
  • Figure 2: Physically-Based-Rendered Dataset Examples
  • Figure 3: Using our setup of 2 Azure Kinect cameras we acquire high resolution color images together with depth information. From left to right: Our camera setup, frame of first camera, frame frame of second camera, one of the corresponding depth images.