Table of Contents
Fetching ...

VGGT-X: When VGGT Meets Dense Novel View Synthesis

Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, Zhaoxiang Zhang

TL;DR

VGGT-X tackles the challenge of scaling 3D Foundation Models to dense novel view synthesis by addressing two principal barriers: excessive VRAM demands and degraded outputs from 3DFMs when used for dense views. It introduces a memory-efficient VGGT, an adaptive Global Alignment module, and robust 3DGS training (including MCMC-3DGS and joint pose optimization) to enable COLMAP-free dense NVS with high fidelity. The approach substantially narrows the fidelity gap to COLMAP-initialized pipelines and achieves state-of-the-art results in both dense NVS and pose estimation across multiple datasets, while offering insights into residual gaps and future improvements for 3D foundation models. These findings support faster, scalable, and more reliable dense NVS systems, with practical impact for online rendering and large-scale scene understanding.

Abstract

We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/

VGGT-X: When VGGT Meets Dense Novel View Synthesis

TL;DR

VGGT-X tackles the challenge of scaling 3D Foundation Models to dense novel view synthesis by addressing two principal barriers: excessive VRAM demands and degraded outputs from 3DFMs when used for dense views. It introduces a memory-efficient VGGT, an adaptive Global Alignment module, and robust 3DGS training (including MCMC-3DGS and joint pose optimization) to enable COLMAP-free dense NVS with high fidelity. The approach substantially narrows the fidelity gap to COLMAP-initialized pipelines and achieves state-of-the-art results in both dense NVS and pose estimation across multiple datasets, while offering insights into residual gaps and future improvements for 3D foundation models. These findings support faster, scalable, and more reliable dense NVS systems, with practical impact for online rendering and large-scale scene understanding.

Abstract

We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/

Paper Structure

This paper contains 19 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Reconstruction and Novel View Synthesis results. In part (a), we extend VGGT to handle dense multi-view inputs and incorporate an efficient global alignment, yielding highly accurate predictions. Part (b) demonstrates that eliminating redundant VRAM usage enables inference throughput over 1000 images without compromising performance. The VGGT$-$ here denotes VGGT with the elimination of redundant intermediate features. Finally, part (c) illustrates that, with an appropriate joint pose and 3DGS optimization strategy, a photorealistic rendering can be realized.
  • Figure 2: Overall pipeline of our model.
  • Figure 3: Qualitative comparison of rendering results. 3DGS$^\dagger$ here means 3DGS trained with COLMAP initialization, and is mainly for reference. Here, Apple is from CO3Dv2 dataset, Garden and Stump are from MipNeRF360 dataset, Ignatius and Caterpillar are from TnT dataset.
  • Figure 4: Qualitative comparison of estimated trajectories. Here we also report the Root Mean Square Error (RMSE) of the Absolute Trajectory Error (ATE) (in meters) matsuki2024gaussian. The color bar indicates trajectory distance. We recommend zooming in for better details.
  • Figure 5: Bad case analysis. The blue and red histograms respectively correspond to rotation and translation residual distribution. The right part shows blurry artifacts caused by inaccurate poses.