Table of Contents
Fetching ...

AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

Hengyi Wang, Lourdes Agapito

TL;DR

AMB3R introduces a feed-forward pipeline for metric-scale 3D reconstruction that hinges on a compact sparse-voxel backend fused through a transformer. A frozen VGGT front-end provides multi-view geometry, while a metric-scale head recovers scale, enabling accurate depth, pose, and 3D reconstructions without test-time optimization. The backend supports uncalibrated visual odometry and large-scale structure-from-motion via rating-based clustering and memory-informed mapping, achieving state-of-the-art results across 7 tasks and 13 datasets with modest training resources (~80 GPU hours). This work demonstrates the feasibility of a scalable, unified 3D perception system that leverages spatial compactness to perform robust reconstruction, VO/SLAM, and SfM in a zero-finetune regime and without heavy optimization. The open-source release further accelerates research by providing code, weights, and evaluation tools.

Abstract

We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.

AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

TL;DR

AMB3R introduces a feed-forward pipeline for metric-scale 3D reconstruction that hinges on a compact sparse-voxel backend fused through a transformer. A frozen VGGT front-end provides multi-view geometry, while a metric-scale head recovers scale, enabling accurate depth, pose, and 3D reconstructions without test-time optimization. The backend supports uncalibrated visual odometry and large-scale structure-from-motion via rating-based clustering and memory-informed mapping, achieving state-of-the-art results across 7 tasks and 13 datasets with modest training resources (~80 GPU hours). This work demonstrates the feasibility of a scalable, unified 3D perception system that leverages spatial compactness to perform robust reconstruction, VO/SLAM, and SfM in a zero-finetune regime and without heavy optimization. The open-source release further accelerates research by providing code, weights, and evaluation tools.

Abstract

We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.

Paper Structure

This paper contains 36 sections, 14 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: Overview. We present AMB3R, a feed-forward model for metric-scale 3D reconstruction. AMB3R supports camera pose estimation, monocular/multi-view metric depth/3D reconstruction, and can be seamlessly extended to visual odometry (VO)/SLAM and Structure from Motion (SfM) with no task-specific fine-tuning or test-time optimization. We use in-the-wild images for a)-e), scenes from Co-SLAM wang2023co, TTT3R chen2025ttt3r, and KITTI geiger2012kitti for f), scenes from COLMAP schonberger2016colmap, Tanks&Temples Knapitsch2017tankandtemple, and IMC PhotoTourism jin2021imc (all images) for g). No confidence threshold is used. f) & g) results are randomly down-sampled to 3 million points for visualization.
  • Figure 2: Overview of AMB3R. AMB3R consists of a front-end that predicts pointmaps and geometric features, and a back-end that fuses them into sparse voxels, which are serialized into a 1D sequence, processed by a transformer, and unserialized back to 3D. Per-pixel features are obtained via KNN interpolation and injected into the frozen front-end via zero-convolution for final prediction.
  • Figure 3: Qualitative showcase of generalization to in-the-wild images such as Longmen Grottoes zheng2025culture3d.
  • Figure 4: Training cost comparison. We roughly estimate the training cost of each model. $\ddagger$ indicates concurrent works.
  • Figure 5: Overview of AMB3R (VO). Input frames are mapped with the keyframes in the active keyframe memory to predict geometry and camera poses. After coordinate alignment, we select new keyframes and update the global keyframe memory; poses and geometry for non-keyframes are also stored. If the active keyframe memory is not full, the new keyframe is appended; otherwise, we refresh the active keyframe memory by resampling a new set of keyframes from the global keyframe memory.
  • ...and 8 more figures