Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction
Shahram Najam Syed, Yitian Hu, Yuchao Yao
TL;DR
This work tackles large-scale monocular 3D reconstruction from uncalibrated video by addressing scale drift, pose drift, and the limitations of a single global NeRF. It proposes a unified pipeline that jointly learns metric-depth (via a ViT-based estimator with metric anchors), drift-free poses through multi-scale feature-based bundle adjustment, and an incremental local hash-grid NeRF hierarchy to cover city-scale scenes within limited memory. Progressive training alternates depth warm-up, pose refinement, and radiance fine-tuning, enabling metre-scale consistency and high-fidelity novel-view synthesis without external calibration. On Tanks & Temples, the method achieves centimetre-level ATE and 2x–18x improvements over prior approaches, demonstrating practical, calibration-free 3D capture from a handheld RGB camera with efficient memory usage.</process>
Abstract
Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space--leveraging learned pyramidal descriptors instead of brittle keypoints--to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences--up to 18x lower than BARF and 2x lower than NoPe-NeRF--while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.
