Table of Contents
Fetching ...

Wide-Baseline Relative Camera Pose Estimation with Directional Learning

Kefan Chen, Noah Snavely, Ameesh Makadia

TL;DR

This work tackles wide-baseline relative camera pose estimation by predicting discrete distributions over a factorized pose space rather than direct regression. DirectionNet decomposes the relative pose into four directional components on the sphere $S^2$, estimates distributions for each, and derives the pose via spherical expectation, employing a two-stage strategy with derotation to simplify translation estimation. With an encoder-decoder architecture and loss terms that jointly supervise dense distributions and direction vectors, DirectionNet achieves near 50% error reduction over direct regression on challenging synthetic and real datasets (InteriorNet and Matterport3D). The approach demonstrates robustness to occlusions and perspective changes, outperforming various baselines including parametric probabilistic models and feature-based methods in wide-baseline regimes, and shows promising generalization to outdoor scenes like KITTI. Overall, discrete, directionally parameterized pose modeling offers a practical, scalable path for reliable pose estimation in demanding visual-geometric tasks.

Abstract

Modern deep learning techniques that regress the relative camera pose between two images have difficulty dealing with challenging scenarios, such as large camera motions resulting in occlusions and significant changes in perspective that leave little overlap between images. These models continue to struggle even with the benefit of large supervised training datasets. To address the limitations of these models, we take inspiration from techniques that show regressing keypoint locations in 2D and 3D can be improved by estimating a discrete distribution over keypoint locations. Analogously, in this paper we explore improving camera pose regression by instead predicting a discrete distribution over camera poses. To realize this idea, we introduce DirectionNet, which estimates discrete distributions over the 5D relative pose space using a novel parameterization to make the estimation problem tractable. Specifically, DirectionNet factorizes relative camera pose, specified by a 3D rotation and a translation direction, into a set of 3D direction vectors. Since 3D directions can be identified with points on the sphere, DirectionNet estimates discrete distributions on the sphere as its output. We evaluate our model on challenging synthetic and real pose estimation datasets constructed from Matterport3D and InteriorNet. Promising results show a near 50% reduction in error over direct regression methods.

Wide-Baseline Relative Camera Pose Estimation with Directional Learning

TL;DR

This work tackles wide-baseline relative camera pose estimation by predicting discrete distributions over a factorized pose space rather than direct regression. DirectionNet decomposes the relative pose into four directional components on the sphere , estimates distributions for each, and derives the pose via spherical expectation, employing a two-stage strategy with derotation to simplify translation estimation. With an encoder-decoder architecture and loss terms that jointly supervise dense distributions and direction vectors, DirectionNet achieves near 50% error reduction over direct regression on challenging synthetic and real datasets (InteriorNet and Matterport3D). The approach demonstrates robustness to occlusions and perspective changes, outperforming various baselines including parametric probabilistic models and feature-based methods in wide-baseline regimes, and shows promising generalization to outdoor scenes like KITTI. Overall, discrete, directionally parameterized pose modeling offers a practical, scalable path for reliable pose estimation in demanding visual-geometric tasks.

Abstract

Modern deep learning techniques that regress the relative camera pose between two images have difficulty dealing with challenging scenarios, such as large camera motions resulting in occlusions and significant changes in perspective that leave little overlap between images. These models continue to struggle even with the benefit of large supervised training datasets. To address the limitations of these models, we take inspiration from techniques that show regressing keypoint locations in 2D and 3D can be improved by estimating a discrete distribution over keypoint locations. Analogously, in this paper we explore improving camera pose regression by instead predicting a discrete distribution over camera poses. To realize this idea, we introduce DirectionNet, which estimates discrete distributions over the 5D relative pose space using a novel parameterization to make the estimation problem tractable. Specifically, DirectionNet factorizes relative camera pose, specified by a 3D rotation and a translation direction, into a set of 3D direction vectors. Since 3D directions can be identified with points on the sphere, DirectionNet estimates discrete distributions on the sphere as its output. We evaluate our model on challenging synthetic and real pose estimation datasets constructed from Matterport3D and InteriorNet. Promising results show a near 50% reduction in error over direct regression methods.

Paper Structure

This paper contains 33 sections, 8 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Relative pose estimation. At the core of our method is the DirectionNet, which maps a source image $I_0$ and a target image $I_1$ to a number of directional probability distributions over the 2-sphere, shown here as color-coded spheres. We convert the distributions to vectors by finding their expected values. The rotation matrix $R$ is approximated by orthogonal Procrustes from three estimated unit vectors $(\hat{v}_{x}, \hat{v}_{y}, \hat{v}_{z})$. As an alternative, DirectionNet-R could generate two directional vectors and $R$ could be determined by Gram-Schmidt orthogonalization. To facilitate estimating the translation $\hat{v}_{t}$, we derotate the input images by applying the homography introduced in Sec \ref{['eq:Homography']}, yielding the transformed input images $H_{r}^{T}(I_0)$ and $H_r(I_1)$ where $r$ is half-rotation of the estimated camera rotation $R$.
  • Figure 2: (a) The image encoder generates embeddings from a pair of input images, and the spherical decoder transforms and upsamples these embeddings to produce probability distributions over $S^2$, which are represented with equirectangular maps. (b) Spherical padding ensures the boundary pixels reflect the correct neighbors on the sphere. In this example, a 4$\times$4 grid is padded to size 6$\times$6. The corresponding labeled squares illustrate the padding process. See appendix \ref{['apppadding']} for further clarification.
  • Figure 3: (a) True rotation magnitude ($^\circ$) vs error ($^\circ$). The scatter plot shows that our model is robust in the presence of large relative rotations. (b) Median error ($^\circ$) vs. overlap (%). As image overlap decreases from 90% to 20%, the median test errors of our method increases much slower than the SIFT+LMedS. When overlap is very high, local feature based techniques are still superior.
  • Figure 4: Qualitative evaluation on Matterport-B. Any point in one image plane corresponds to a ray shooting from the optical center, which could be projected to the other image plane as the epipolar line. (a) We draw a number of points detected by SIFT in different colors on each target image $I_1$, and (b) show their corresponding epipolar lines on the source image using the ground truth pose, (c) visualizations from our DirectionNet-9D, (d) Bin & Delta, (e) spherical regression, (f) 6D regression, (g) SIFT+LMedS. Most examples demonstrates some of the most difficult scenarios, such as drastic change in viewpoint and significant occlusion. The last two rows show that SIFT+LMedS can outperform the others in the case of smaller motions for which the the feature-based approach can find reliable feature correspondences.
  • Figure 5: Discrete distributions on the sphere (left) are represented internally as equirectangular grids (right). Although pixels A and B are adjacent on the sphere, as are C and D, they are not adjacent in the grid. Our spherical padding (shown in Fig. 2b, page 4) corrects for this.
  • ...and 10 more figures