Extending 6D Object Pose Estimators for Stereo Vision

Thomas Pöllabauer; Jan Emrich; Volker Knauthe; Arjan Kuijper

Extending 6D Object Pose Estimators for Stereo Vision

Thomas Pöllabauer, Jan Emrich, Volker Knauthe, Arjan Kuijper

TL;DR

This work created a BOP compatible stereo version of the YCB-V dataset and outperforms state-of-the-art 6D pose estimation algorithms by utilizing stereo vision and can be adopted for other dense feature-based algorithms.

Abstract

Estimating the 6D pose of objects accurately, quickly, and robustly remains a difficult task. However, recent methods for directly regressing poses from RGB images using dense features have achieved state-of-the-art results. Stereo vision, which provides an additional perspective on the object, can help reduce pose ambiguity and occlusion. Moreover, stereo can directly infer the distance of an object, while mono-vision requires internalized knowledge of the object's size. To extend the state-of-the-art in 6D object pose estimation to stereo, we created a BOP compatible stereo version of the YCB-V dataset. Our method outperforms state-of-the-art 6D pose estimation algorithms by utilizing stereo vision and can easily be adopted for other dense feature-based algorithms.

Extending 6D Object Pose Estimators for Stereo Vision

TL;DR

Abstract

Paper Structure (18 sections, 3 figures, 1 table)

This paper contains 18 sections, 3 figures, 1 table.

Introduction
Related Work
Keypoint Methods
Iterative Pose Refinement
End-to-end dense Feature 6DOPE
Disparity and Stereo
BOP Challenge
Approach
Early Fusion
Mid-Level Fusion
Late Fusion
Double Fusion
Early Fusion with shared Backbone Disparity Prediction
Evaluation
Synthetic and real Dataset
...and 3 more sections

Figures (3)

Figure 1: Details on some proposed architectures for feature fusion at different pipeline stages. Exemplified on the SO-Pose algorithm, but also applicable to GDRN. a) Early fusion. Feature maps are merged after individual forward propagation through the backbone. b) Mid-level fusion. After two separate forward passes the embeddings are concatenated and additional convolutional layers are added inside the $PnP$ Net to cope with the additional dimensionality. c) Late fusion. Features are fused only after the CNN inside the $PnP$ network but before the multi-layer perceptron / dense network. d) Double fusion. Here we re-use the late fusion $PnP$ Net and combine it with an embedding mixing scheme where half of the feature maps are swapped between left and right. This way each individual forward pass has information from both images. e) Early + shared backbone based disparity prediction. We add a disparity prediction using the feature maps as extracted from our common backbone. Disparity prediction is done both ways in a symmetric fashion. The additional features are again concatenated and fed to the $PnP$ network for pose regression.
Figure 2: Physically-Based-Rendered Dataset Examples
Figure 3: Using our setup of 2 Azure Kinect cameras we acquire high resolution color images together with depth information. From left to right: Our camera setup, frame of first camera, frame frame of second camera, one of the corresponding depth images.

Extending 6D Object Pose Estimators for Stereo Vision

TL;DR

Abstract

Extending 6D Object Pose Estimators for Stereo Vision

Authors

TL;DR

Abstract

Table of Contents

Figures (3)