AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
Hengyi Wang, Lourdes Agapito
TL;DR
AMB3R introduces a feed-forward pipeline for metric-scale 3D reconstruction that hinges on a compact sparse-voxel backend fused through a transformer. A frozen VGGT front-end provides multi-view geometry, while a metric-scale head recovers scale, enabling accurate depth, pose, and 3D reconstructions without test-time optimization. The backend supports uncalibrated visual odometry and large-scale structure-from-motion via rating-based clustering and memory-informed mapping, achieving state-of-the-art results across 7 tasks and 13 datasets with modest training resources (~80 GPU hours). This work demonstrates the feasibility of a scalable, unified 3D perception system that leverages spatial compactness to perform robust reconstruction, VO/SLAM, and SfM in a zero-finetune regime and without heavy optimization. The open-source release further accelerates research by providing code, weights, and evaluation tools.
Abstract
We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.
