Conditional Variational Autoencoders for Probabilistic Pose Regression

Fereidoon Zangeneh; Leonard Bruns; Amit Dekel; Alessandro Pieropan; Patric Jensfelt

Conditional Variational Autoencoders for Probabilistic Pose Regression

Fereidoon Zangeneh, Leonard Bruns, Amit Dekel, Alessandro Pieropan, Patric Jensfelt

TL;DR

This work proposes a probabilistic method to predict the posterior distribution of camera poses given an observed image and results in a generative model of camera poses given an image, which can be used to draw samples from the pose posterior distribution.

Abstract

Robots rely on visual relocalization to estimate their pose from camera images when they lose track. One of the challenges in visual relocalization is repetitive structures in the operation environment of the robot. This calls for probabilistic methods that support multiple hypotheses for robot's pose. We propose such a probabilistic method to predict the posterior distribution of camera poses given an observed image. Our proposed training strategy results in a generative model of camera poses given an image, which can be used to draw samples from the pose posterior distribution. Our method is streamlined and well-founded in theory and outperforms existing methods on localization in presence of ambiguities.

Conditional Variational Autoencoders for Probabilistic Pose Regression

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 4 figures, 2 tables)

This paper contains 27 sections, 2 equations, 4 figures, 2 tables.

Introduction
Related Work
Visual relocalization
Uncertainty estimation
Method
Pose generative model
Training as a conditional variational autoencoder
Setting
Learning-by-reconstruction
Optimization objective terms
Intuition
Implementation Details
Network architecture
Training setup
Experiments
...and 12 more sections

Figures (4)

Figure 1:
Figure 2: (a) Our pose generative model is trained as the decoder in a conditional variational autoencoder pipeline reconstructing the ground-truth pose $y \in \mathrm{SE}(3)$ for an image $\boldsymbol{x} \in \mathbb{R}^{H \times W \times 3}$. The loss terms used in the learning objective are shown in orange. During training the latent posterior only partly overlaps with the latent prior, resulting in generated pose samples concentrated at the ground-truth pose. (b) At inference time latent samples are drawn from the prior distribution and mapped to distinct modes in $\mathrm{SE}(3)$. In the 3D rendering of the scene we can see that for the query image viewing an ambiguous landing at the staircase, output pose samples are concentrated at three modes looking at different, but visually similar landings, including the ground truth. Pose samples are shown by teal and the ground truth by orange camera frusta.
Figure 3:
Figure 4:

Conditional Variational Autoencoders for Probabilistic Pose Regression

TL;DR

Abstract

Conditional Variational Autoencoders for Probabilistic Pose Regression

Authors

TL;DR

Abstract

Table of Contents

Figures (4)