Table of Contents
Fetching ...

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao

Abstract

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Abstract

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.
Paper Structure (13 sections, 6 equations, 10 figures, 2 tables)

This paper contains 13 sections, 6 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Comparison between the conventional training-free panoramic depth estimation framework and our VGGT-360. Unlike view-independent inference methods (e.g., 360MD rey2022360monodepth), VGGT-360 reconstructs a globally coherent 3D representation via VGGT-like 3D foundation models and reprojects it to the panorama, unifying fragmented per-view predictions into consistent, cross-view correlated depth with superior performance.
  • Figure 2: Framework Overview of VGGT-360. Given a panoramic image, we first perform uncertainty-guided adaptive projection to produce geometry-informative views for VGGT. With structure-saliency enhanced attention, VGGT reconstructs a structure-faithful 3D model, which is then refined by correlation-weighted 3D model correction and reprojected into a globally consistent panoramic depth map.
  • Figure 3: Pipeline of our uncertainty-guided adaptive projection. We first generate $N_\mathcal{B}$ base views from the panorama, compute per-view uncertainty maps via edge-based scoring, and select the top-$K$ most uncertain views (with $N_\mathcal{B}{=}6$, $K{=}1$ in this example). These views are then augmented with neighboring projections to form a geometry-aware multi-view set as input for VGGT.
  • Figure 4: Comparison of results before and after applying our structure-saliency enhanced attention mechanism. Guided by our well-designed structure-aware confidence map, our VGGT-360 effectively removes artifacts and preserves geometric structures in weakly structured regions, which are easily affected by illumination cues and noise.
  • Figure 5: Pipeline of our correlation-weighted 3D model correction module. Each overlapping 3D point is assigned a correlation score derived from VGGT’s intra-frame attention. These scores are then used as reliability weights to refine the reconstructed 3D model and produce the final coherent ERP depth.
  • ...and 5 more figures