FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

Martin Cífka; Georgy Ponimatkin; Yann Labbé; Bryan Russell; Mathieu Aubry; Vladimir Petrik; Josef Sivic

FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

Martin Cífka, Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Vladimir Petrik, Josef Sivic

TL;DR

This work derives a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task, and investigates several different loss functions for jointly estimating the object pose and focal length.

Abstract

We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are threefold. First, we derive a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task. Second, we investigate several different loss functions for jointly estimating the object pose and focal length. We find that a combination of direct focal length regression with a reprojection loss disentangling the contribution of translation, rotation, and focal length leads to improved results. Third, we explore the effect of different synthetic training data on the performance of our method. Specifically, we investigate different distributions used for sampling object's 6D pose and camera's focal length when rendering the synthetic images, and show that parametric distribution fitted on real training data works the best. We show results on three challenging benchmark datasets that depict known 3D models in uncontrolled settings. We demonstrate that our focal length and 6D pose estimates have lower error than the existing state-of-the-art methods.

FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

TL;DR

Abstract

Paper Structure (43 sections, 16 equations, 4 figures, 9 tables)

This paper contains 43 sections, 16 equations, 4 figures, 9 tables.

Introduction
Related Work
6D pose estimation of rigid objects from RGB images.
Camera calibration.
Joint 6D pose and focal length estimation from a single in-the-wild image.
Approach
Approach Overview
Discussion.
Update rules with focal length estimation
Focal length update.
6D pose update.
Pose and focal length training loss
Focal length loss.
6D pose loss.
Training data
...and 28 more sections

Figures (4)

Figure 1: Given a single input photograph (left) and a known 3D model, our approach accurately estimates the 6D camera-object pose together with the focal length of the camera (right), here shown by overlaying the aligned 3D model over the input image. Our approach handles a large range of focal lengths and the resulting perspective effects.
Figure 2: FocalPose overview.(a) Given a single in-the-wild RGB input image $I$ of a known object 3D model $\mathcal{M}$, parameters $\theta^k$ composed of focal length $f^k$ and the object 6D pose (3D translation $t^k$ and 3D rotation $R^k$) are iteratively updated using our render-and-compare approach. The rendering $R$, together with the input image $I$, is given to a deep neural network $F$ that predicts the update $\Delta \theta_k$, which is then converted into the parameter update $\theta^{k+1}$ using a non-linear update rule $U$. (b) Illustration of the camera-object setup with parameters $\theta$ composed of 3D translation $t$, 3D rotation $R$ and focal length $f$. The alignment network is trained using a novel pose and focal length loss that disentangles the focal length and pose updates. The two main contributions of this work are highlighted by red boxes in the figure.
Figure 3: Parametric distribution of object poses and focal lengths in the training data. We plot the poses and focal lengths of the real training dataset of Pix3D-sofa class (blue) together with poses and focal lengths sampled from the parametric distribution fitted to the data (orange). The number of samples from our distribution is the same as the number of data points in the real training dataset. We plot the rotations, xy-translations, and z-translations with focal lengths separately. To visualize the rotations, we plot the unit x-vector multiplied by the sampled rotations.
Figure 4: Main failure modes are: (a) symmetric objects, (b) local minima, and (c) incorrect 3D models identified by the object detector.

FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

TL;DR

Abstract

FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

Authors

TL;DR

Abstract

Table of Contents

Figures (4)