Table of Contents
Fetching ...

KVN: Keypoints Voting Network with Differentiable RANSAC for Stereo Pose Estimation

Ivano Donadi, Alberto Pretto

TL;DR

This letter addresses the stereo image-based object pose estimation problem by introducing a differentiable RANSAC layer into a well-known monocular pose estimation network and exploiting an uncertainty-driven multi-view PnP solver which can fuse information from multiple views.

Abstract

Object pose estimation is a fundamental computer vision task exploited in several robotics and augmented reality applications. Many established approaches rely on predicting 2D-3D keypoint correspondences using RANSAC (Random sample consensus) and estimating the object pose using the PnP (Perspective-n-Point) algorithm. Being RANSAC non-differentiable, correspondences cannot be directly learned in an end-to-end fashion. In this paper, we address the stereo image-based object pose estimation problem by i) introducing a differentiable RANSAC layer into a well-known monocular pose estimation network; ii) exploiting an uncertainty-driven multi-view PnP solver which can fuse information from multiple views. We evaluate our approach on a challenging public stereo object pose estimation dataset and a custom-built dataset we call Transparent Tableware Dataset (TTD), yielding state-of-the-art results against other recent approaches. Furthermore, in our ablation study, we show that the differentiable RANSAC layer plays a significant role in the accuracy of the proposed method. We release with this paper the code of our method and the TTD dataset.

KVN: Keypoints Voting Network with Differentiable RANSAC for Stereo Pose Estimation

TL;DR

This letter addresses the stereo image-based object pose estimation problem by introducing a differentiable RANSAC layer into a well-known monocular pose estimation network and exploiting an uncertainty-driven multi-view PnP solver which can fuse information from multiple views.

Abstract

Object pose estimation is a fundamental computer vision task exploited in several robotics and augmented reality applications. Many established approaches rely on predicting 2D-3D keypoint correspondences using RANSAC (Random sample consensus) and estimating the object pose using the PnP (Perspective-n-Point) algorithm. Being RANSAC non-differentiable, correspondences cannot be directly learned in an end-to-end fashion. In this paper, we address the stereo image-based object pose estimation problem by i) introducing a differentiable RANSAC layer into a well-known monocular pose estimation network; ii) exploiting an uncertainty-driven multi-view PnP solver which can fuse information from multiple views. We evaluate our approach on a challenging public stereo object pose estimation dataset and a custom-built dataset we call Transparent Tableware Dataset (TTD), yielding state-of-the-art results against other recent approaches. Furthermore, in our ablation study, we show that the differentiable RANSAC layer plays a significant role in the accuracy of the proposed method. We release with this paper the code of our method and the TTD dataset.
Paper Structure (23 sections, 11 equations, 5 figures, 3 tables)

This paper contains 23 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Predicted (blue) vs. ground truth (green) poses for KVN on 4 objects from the TOD dataset (a-d) and 2 objects from the TTD dataset (e,f). (\ref{['fig:teaser:bottle']}) and (\ref{['fig:teaser:cup']}) contain symmetric objects, respectively a small bottle and an upside-down cup, while (\ref{['fig:teaser:mug']}), (\ref{['fig:teaser:tree']}), (\ref{['fig:ttd_result1']}), and (\ref{['fig:ttd_result2']}) contain a mug, a Christmas tree, a coffee cup, and a candle holder, respectively. KVN can provide correct estimations even in cases where the object is barely distinguishable from the background, as in (\ref{['fig:teaser:cup']}), or occluded, as in (e,f).
  • Figure 2: Overview of the KVN pipeline: i) the input stereo pair is processed by a shared-weights PVNet architecture to obtain an object segmentation mask and pixel-wise vectors pointing to each keypoints' projections; ii) for each keypoint, a set of $N_h$ hypotheses is obtained via minimal set sampling (here $N_h=3$), then each hypothesis is weighted by all vectors by using our sub-differentiable scoring function; iii) for each keypoint, we compute a probability distribution over all valid hypotheses; iv) from each distribution, a keypoint is estimated along with its uncertainty: the position of the keypoints and the object mask are optimized during training; v) an uncertainty-driven multi-view PnP module (UM-PnP) estimates the object pose minimizing stereo reprojection errors weighted on the inverted keypoint covariance.
  • Figure 3: (Top row) The heart$_0$ object from the TOD dataset over two differently textured mats; (Bottom row) Two images from the TTD dataset.
  • Figure 4: Four scoring functions compared in the evaluation of differentiable RANSAC. The x-axis represents the cosine similarity between estimated and ground truth unit vectors.
  • Figure 5: Qualitative results of our method. (First row) Left input image; (Second row) Predicted segmentation mask; (Third row) Predicted (diamond) vs ground truth (circle) keypoints + hypotheses variance; (Fourth row) Predicted (blue) vs ground truth (green) 3D bounding boxes. In Fig. \ref{['fig:qr:8']} and Fig. \ref{['fig:qr:4']}, we report an incorrect estimation in the left column and a correct one in the right column. Fig. \ref{['fig:qr:8']} displays the most common source of wrong pose estimation for our method, which is a bad object segmentation. In particular, texture 8 has proven to be the hardest in the dataset, due to its peculiar pattern. Similarly, in Fig. \ref{['fig:qr:4']} the object is badly segmented due to a high level of motion blur in the image, causing high variance in keypoints close to the bottom of the bottle and, consequently, an inaccurate object scale. Finally, Fig. \ref{['fig:qr:2']} shows how choosing a canonical pose allows the network to learn accurate keypoints positions for symmetric objects, with no rotation ambiguity.