Table of Contents
Fetching ...

FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

Martin Cífka, Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Vladimir Petrik, Josef Sivic

TL;DR

This work derives a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task, and investigates several different loss functions for jointly estimating the object pose and focal length.

Abstract

We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are threefold. First, we derive a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task. Second, we investigate several different loss functions for jointly estimating the object pose and focal length. We find that a combination of direct focal length regression with a reprojection loss disentangling the contribution of translation, rotation, and focal length leads to improved results. Third, we explore the effect of different synthetic training data on the performance of our method. Specifically, we investigate different distributions used for sampling object's 6D pose and camera's focal length when rendering the synthetic images, and show that parametric distribution fitted on real training data works the best. We show results on three challenging benchmark datasets that depict known 3D models in uncontrolled settings. We demonstrate that our focal length and 6D pose estimates have lower error than the existing state-of-the-art methods.

FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

TL;DR

This work derives a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task, and investigates several different loss functions for jointly estimating the object pose and focal length.

Abstract

We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are threefold. First, we derive a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task. Second, we investigate several different loss functions for jointly estimating the object pose and focal length. We find that a combination of direct focal length regression with a reprojection loss disentangling the contribution of translation, rotation, and focal length leads to improved results. Third, we explore the effect of different synthetic training data on the performance of our method. Specifically, we investigate different distributions used for sampling object's 6D pose and camera's focal length when rendering the synthetic images, and show that parametric distribution fitted on real training data works the best. We show results on three challenging benchmark datasets that depict known 3D models in uncontrolled settings. We demonstrate that our focal length and 6D pose estimates have lower error than the existing state-of-the-art methods.
Paper Structure (43 sections, 16 equations, 4 figures, 9 tables)

This paper contains 43 sections, 16 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Given a single input photograph (left) and a known 3D model, our approach accurately estimates the 6D camera-object pose together with the focal length of the camera (right), here shown by overlaying the aligned 3D model over the input image. Our approach handles a large range of focal lengths and the resulting perspective effects.
  • Figure 2: FocalPose overview.(a) Given a single in-the-wild RGB input image $I$ of a known object 3D model $\mathcal{M}$, parameters $\theta^k$ composed of focal length $f^k$ and the object 6D pose (3D translation $t^k$ and 3D rotation $R^k$) are iteratively updated using our render-and-compare approach. The rendering $R$, together with the input image $I$, is given to a deep neural network $F$ that predicts the update $\Delta \theta_k$, which is then converted into the parameter update $\theta^{k+1}$ using a non-linear update rule $U$. (b) Illustration of the camera-object setup with parameters $\theta$ composed of 3D translation $t$, 3D rotation $R$ and focal length $f$. The alignment network is trained using a novel pose and focal length loss that disentangles the focal length and pose updates. The two main contributions of this work are highlighted by red boxes in the figure.
  • Figure 3: Parametric distribution of object poses and focal lengths in the training data. We plot the poses and focal lengths of the real training dataset of Pix3D-sofa class (blue) together with poses and focal lengths sampled from the parametric distribution fitted to the data (orange). The number of samples from our distribution is the same as the number of data points in the real training dataset. We plot the rotations, xy-translations, and z-translations with focal lengths separately. To visualize the rotations, we plot the unit x-vector multiplied by the sampled rotations.
  • Figure 4: Main failure modes are: (a) symmetric objects, (b) local minima, and (c) incorrect 3D models identified by the object detector.