Table of Contents
Fetching ...

VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors

Yuhang Ming, Tingkang Xi, Xingrui Yang, Lixin Yang, Yong Peng, Cewu Lu, Wanzeng Kong

Abstract

Scene-level neural volumetric reconstruction from monocular videos remains challenging, especially under severe domain shifts. Although recent advances in vision foundation models (VFMs) provide transferable generalized priors learned from large-scale data, their scaleambiguous predictions are incompatible with the scale consistency required by volumetric fusion. To address this gap, we present VFMRecon, the first attempt to bridge transferable VFM priors with scaleconsistent requirements in scene-level neural reconstruction. Specifically, we first introduce a lightweight scale alignment stage that restores multiview scale coherence. We then integrate pretrained VFM features into the neural volumetric reconstruction pipeline via lightweight task-specific adapters, which are trained for reconstruction while preserving the crossdomain robustness of pretrained representations. We train our model on ScanNet train split and evaluate on both in-distribution ScanNet test split and out-of-distribution TUM RGB-D and Tanks and Temples datasets. The results demonstrate that our model achieves state-of-theart performance across all datasets domains. In particular, on the challenging outdoor Tanks and Temples dataset, our model achieves an F1 score of 70.1 in reconstructed mesh evaluation, substantially outperforming the closest competitor, VGGT, which only attains 51.8.

VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors

Abstract

Scene-level neural volumetric reconstruction from monocular videos remains challenging, especially under severe domain shifts. Although recent advances in vision foundation models (VFMs) provide transferable generalized priors learned from large-scale data, their scaleambiguous predictions are incompatible with the scale consistency required by volumetric fusion. To address this gap, we present VFMRecon, the first attempt to bridge transferable VFM priors with scaleconsistent requirements in scene-level neural reconstruction. Specifically, we first introduce a lightweight scale alignment stage that restores multiview scale coherence. We then integrate pretrained VFM features into the neural volumetric reconstruction pipeline via lightweight task-specific adapters, which are trained for reconstruction while preserving the crossdomain robustness of pretrained representations. We train our model on ScanNet train split and evaluate on both in-distribution ScanNet test split and out-of-distribution TUM RGB-D and Tanks and Temples datasets. The results demonstrate that our model achieves state-of-theart performance across all datasets domains. In particular, on the challenging outdoor Tanks and Temples dataset, our model achieves an F1 score of 70.1 in reconstructed mesh evaluation, substantially outperforming the closest competitor, VGGT, which only attains 51.8.
Paper Structure (14 sections, 12 equations, 5 figures, 4 tables)

This paper contains 14 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We present VFM-Recon, a novel framework that enables cross-domain, scale-consistent neural volumetric reconstruction. Existing neural reconstruction methods such as FineRecon finerecon2023iccv perform well on indoor scenes but struggles under domain shift, while VFMs such as VGGT vggt2025cvpr predicts plausible point maps, yet lack scale consistency for mesh reconstruction. The radar plot highlights consistent performance gains across ScanNet scannet2017cvpr, TUM RGB-D tumrgbd2012iros, and Tanks and Temples tat2017tog.
  • Figure 2: System Overview. On the top is our VFM-augmented volumetric reconstruction network, and in the bottom is our lightweight scale alignment module. Our system divides the sequence into overlapping submaps and uses VGGT vggt2025cvpr to predict scale-ambiguous depths. It then recovers per-submap scales via feature matching and triangulation, followed by factor-graph global scale optimization. The aligned depth are further refined with a fuse-and-back-projection process and are used as geometric priors for subsequent VFM-augmented neural reconstruction.
  • Figure 3: Network Architecture. Details of our adapted VGGT for VFM-augmented volumetric reconstruction network, in which only the MLP adapter is trainable.
  • Figure 4: Qualitative Comparison on ScanNet scannet2017cvpr. Compared with NeuralRecon neuralrecon2021cvpr and FineRecon finerecon2023iccv, our VFM-Recon reconstructs better structure details like staircases, chair/tables legs, and TV surfaces, as highlighted in the red bounding boxes.
  • Figure 5: Qualitative Comparison on Tanks and Temples tat2017tog. Compared with FineRecon finerecon2023iccv and VGGT vggt2025cvpr, our VFM-Recon reconstructs more complete structures with fewer fragmented artifacts across challenging outdoor scenes.