Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart, Daniyar Turmukhambetov, Victor Adrian Prisacariu
TL;DR
<3-5 sentence high-level summary> ACE0 introduces Scene Coordinate Reconstruction, a self-supervised learning-based SfM framework that iteratively learns a scene-coordinate regression model from unposed RGB images and relocalizes additional views to refine the scene map. By exploiting a fast, neural mapping approach and a PnP+RANSAC relocalizer, it reconstructs thousands of images without pose priors and achieves pose accuracy approaching traditional feature-based SfM on diverse datasets. The method includes pose and calibration refinement, seed initialization strategies, and adaptive training stopping to ensure scalability, while enabling high-quality novel view synthesis using the learned scene representation. ACE0 demonstrates robust performance across indoor and outdoor scenes (7-Scenes, Mip-NeRF 360, Tanks and Temples) with competitive PSNR and efficient runtimes on a single GPU, offering a practical alternative to heavy NeRF-based or traditional SfM pipelines.
Abstract
We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. In many cases, our method, ACE0, estimates camera poses with an accuracy close to feature-based SfM, as demonstrated by novel view synthesis. Project page: https://nianticlabs.github.io/acezero/
