Wide-Baseline Relative Camera Pose Estimation with Directional Learning
Kefan Chen, Noah Snavely, Ameesh Makadia
TL;DR
This work tackles wide-baseline relative camera pose estimation by predicting discrete distributions over a factorized pose space rather than direct regression. DirectionNet decomposes the relative pose into four directional components on the sphere $S^2$, estimates distributions for each, and derives the pose via spherical expectation, employing a two-stage strategy with derotation to simplify translation estimation. With an encoder-decoder architecture and loss terms that jointly supervise dense distributions and direction vectors, DirectionNet achieves near 50% error reduction over direct regression on challenging synthetic and real datasets (InteriorNet and Matterport3D). The approach demonstrates robustness to occlusions and perspective changes, outperforming various baselines including parametric probabilistic models and feature-based methods in wide-baseline regimes, and shows promising generalization to outdoor scenes like KITTI. Overall, discrete, directionally parameterized pose modeling offers a practical, scalable path for reliable pose estimation in demanding visual-geometric tasks.
Abstract
Modern deep learning techniques that regress the relative camera pose between two images have difficulty dealing with challenging scenarios, such as large camera motions resulting in occlusions and significant changes in perspective that leave little overlap between images. These models continue to struggle even with the benefit of large supervised training datasets. To address the limitations of these models, we take inspiration from techniques that show regressing keypoint locations in 2D and 3D can be improved by estimating a discrete distribution over keypoint locations. Analogously, in this paper we explore improving camera pose regression by instead predicting a discrete distribution over camera poses. To realize this idea, we introduce DirectionNet, which estimates discrete distributions over the 5D relative pose space using a novel parameterization to make the estimation problem tractable. Specifically, DirectionNet factorizes relative camera pose, specified by a 3D rotation and a translation direction, into a set of 3D direction vectors. Since 3D directions can be identified with points on the sphere, DirectionNet estimates discrete distributions on the sphere as its output. We evaluate our model on challenging synthetic and real pose estimation datasets constructed from Matterport3D and InteriorNet. Promising results show a near 50% reduction in error over direct regression methods.
