Table of Contents
Fetching ...

Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart, Daniyar Turmukhambetov, Victor Adrian Prisacariu

TL;DR

<3-5 sentence high-level summary> ACE0 introduces Scene Coordinate Reconstruction, a self-supervised learning-based SfM framework that iteratively learns a scene-coordinate regression model from unposed RGB images and relocalizes additional views to refine the scene map. By exploiting a fast, neural mapping approach and a PnP+RANSAC relocalizer, it reconstructs thousands of images without pose priors and achieves pose accuracy approaching traditional feature-based SfM on diverse datasets. The method includes pose and calibration refinement, seed initialization strategies, and adaptive training stopping to ensure scalability, while enabling high-quality novel view synthesis using the learned scene representation. ACE0 demonstrates robust performance across indoor and outdoor scenes (7-Scenes, Mip-NeRF 360, Tanks and Temples) with competitive PSNR and efficient runtimes on a single GPU, offering a practical alternative to heavy NeRF-based or traditional SfM pipelines.

Abstract

We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. In many cases, our method, ACE0, estimates camera poses with an accuracy close to feature-based SfM, as demonstrated by novel view synthesis. Project page: https://nianticlabs.github.io/acezero/

Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

TL;DR

<3-5 sentence high-level summary> ACE0 introduces Scene Coordinate Reconstruction, a self-supervised learning-based SfM framework that iteratively learns a scene-coordinate regression model from unposed RGB images and relocalizes additional views to refine the scene map. By exploiting a fast, neural mapping approach and a PnP+RANSAC relocalizer, it reconstructs thousands of images without pose priors and achieves pose accuracy approaching traditional feature-based SfM on diverse datasets. The method includes pose and calibration refinement, seed initialization strategies, and adaptive training stopping to ensure scalability, while enabling high-quality novel view synthesis using the learned scene representation. ACE0 demonstrates robust performance across indoor and outdoor scenes (7-Scenes, Mip-NeRF 360, Tanks and Temples) with competitive PSNR and efficient runtimes on a single GPU, offering a practical alternative to heavy NeRF-based or traditional SfM pipelines.

Abstract

We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. In many cases, our method, ACE0, estimates camera poses with an accuracy close to feature-based SfM, as demonstrated by novel view synthesis. Project page: https://nianticlabs.github.io/acezero/
Paper Structure (34 sections, 5 equations, 13 figures, 22 tables)

This paper contains 34 sections, 5 equations, 13 figures, 22 tables.

Figures (13)

  • Figure 1: Reconstructing 10,000 Images. Top: Starting from a single image and the identity pose, we train a learning-based visual relocalizer. The relocalizer allows us to estimate the poses of more views, and the additional views allow us to refine the relocalizer. We show three out of six iterations for this scene (7Scenes Office shotton2013scene). All 10k images have been posed in roughly 1 hour on a single GPU. In comparison NoPe-NeRF bian2023nope needs two days to pose 200 images. The point cloud is a visualization of the implicit scene representation of the relocalizer. Camera positions are color coded by relocalization confidence from yellow (low) to green (high). Bottom: Point clouds from Nerfacto nerfstudio trained on top of our poses for a few scenes from our experiments.
  • Figure 2: ACE0 Framework. Top left: We loop between learning a reconstruction from the current set of images and poses ("neural mapping"), and estimating poses of more images ("relocalization"). Top right: During the mapping stage, we train a scene coordinate regression network as our scene representation. Camera poses of the last relocalization round and camera calibration parameters are refined during this process. We visualize scene coordinates by mapping XYZ to the RGB cube. Bottom: In the relocalization stage, we re-estimate poses of images using the scene coordinate regression network, including images that were previously not registered to the reconstruction. If the registration of an image succeeds, it will be used in the next iteration of the mapping stage; otherwise it will not.
  • Figure 3: Left: Pose Refinement. Since we register images based on a coarse and incomplete state of the reconstruction, we add the ability to refine poses during neural mapping. An MLP predicts pose updates relative to the initial poses, supervised by the reprojection error of scene coordinates. Right: Initialization. To start the reconstruction, we train the network using one image, the identity pose and a depth estimate, here ZoeDepth bhat2023zoedepth. In this example, we register 33 views to the initial reconstruction. Depth estimates are only used for this step.
  • Figure 4: Reconstructed Poses. We show poses estimated by ACE0 for a selection of scenes. We color code the reconstruction iteration in which a particular view has been registered. We show the ACE0 point cloud as a representation of the scene. The seed image is shown as a gray frustum. We also compare our poses to poses estimated by COLMAP (Mip-NeRF 360, Tanks and Temples) and KinectFusion (7-Scenes).
  • Figure 5: More Reconstructed Poses. We show poses estimated by ACE0. We color code the reconstruction iteration in which a particular view has been registered. We show the ACE0 point cloud as a representation of the scene. We also compare our poses to poses estimated by COLMAP (Mip-NeRF 360, Tanks and Temples) and KinectFusion (7-Scenes).
  • ...and 8 more figures