Table of Contents
Fetching ...

Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction

Shahram Najam Syed, Yitian Hu, Yuchao Yao

TL;DR

This work tackles large-scale monocular 3D reconstruction from uncalibrated video by addressing scale drift, pose drift, and the limitations of a single global NeRF. It proposes a unified pipeline that jointly learns metric-depth (via a ViT-based estimator with metric anchors), drift-free poses through multi-scale feature-based bundle adjustment, and an incremental local hash-grid NeRF hierarchy to cover city-scale scenes within limited memory. Progressive training alternates depth warm-up, pose refinement, and radiance fine-tuning, enabling metre-scale consistency and high-fidelity novel-view synthesis without external calibration. On Tanks & Temples, the method achieves centimetre-level ATE and 2x–18x improvements over prior approaches, demonstrating practical, calibration-free 3D capture from a handheld RGB camera with efficient memory usage.</process>

Abstract

Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space--leveraging learned pyramidal descriptors instead of brittle keypoints--to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences--up to 18x lower than BARF and 2x lower than NoPe-NeRF--while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.

Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction

TL;DR

This work tackles large-scale monocular 3D reconstruction from uncalibrated video by addressing scale drift, pose drift, and the limitations of a single global NeRF. It proposes a unified pipeline that jointly learns metric-depth (via a ViT-based estimator with metric anchors), drift-free poses through multi-scale feature-based bundle adjustment, and an incremental local hash-grid NeRF hierarchy to cover city-scale scenes within limited memory. Progressive training alternates depth warm-up, pose refinement, and radiance fine-tuning, enabling metre-scale consistency and high-fidelity novel-view synthesis without external calibration. On Tanks & Temples, the method achieves centimetre-level ATE and 2x–18x improvements over prior approaches, demonstrating practical, calibration-free 3D capture from a handheld RGB camera with efficient memory usage.</process>

Abstract

Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space--leveraging learned pyramidal descriptors instead of brittle keypoints--to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences--up to 18x lower than BARF and 2x lower than NoPe-NeRF--while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.

Paper Structure

This paper contains 24 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of our joint depth–pose–radiance pipeline. A ViT-based depth module predicts metric depth (left), a coarse-to-fine feature-metric bundle-adjustment (center) refines SE(3) poses with learned confidences, and an incremental hierarchy of local hash-grid NeRFs (right) renders colour/depth while freezing completed fields for memory efficiency.
  • Figure 2: Qualitative comparison. Our method (right column of each sub-figure) avoids scale drift and preserves fine detail, while BARF produces ghost geometry and NoPe-NeRF blurs edges.