PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Yijing Guo; Mengjun Chao; Luo Wang; Tianyang Zhao; Haizhao Dai; Yingliang Zhang; Jingyi Yu; Yujiao Shi

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, Yujiao Shi

Abstract

Panoramic imagery offers a full 360° field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Abstract

Paper Structure (22 sections, 11 equations, 6 figures, 10 tables)

This paper contains 22 sections, 11 equations, 6 figures, 10 tables.

Introduction
Related Work
Dataset - PanoCity
Method – PanoVGGT
Framework Overview
Spherical-aware Position Embedding
Geometry Aggregator
Panorama-Specific Data Augmentation
Loss Functions
Experiments
Camera Pose Estimation
Monocular Depth Estimation
Point Cloud Reconstruction
Ablation Studies
Conclusion
...and 7 more sections

Figures (6)

Figure 1: Overview of PanoVGGT and the PanoCity dataset. PanoVGGT predicts camera poses, dense depth, and consistent 3D point clouds from unordered panoramas. Compared with perspective-based pipelines that split panoramas into pinhole views, it yields more accurate 3D reconstructions. PanoCity contains over 120k outdoor panoramas under diverse urban scenes and weather conditions.
Figure 2: Overview of the proposed PanoVGGT framework. Given one or multiple panoramic images, PanoVGGT extracts spherical patch tokens using a ViT backbone, augments them via $SO(3)$ geometric augmentation, and encodes them with shared and branch-adapted positional embeddings. A multi-branch attention aggregator jointly reasons about local geometry, global structure, and camera motion. The model outputs dense depth maps, camera poses (under a randomly selected global anchor), and local/global point maps for final 3D reconstruction.
Figure 3: Multi-view point-cloud reconstructions on the Matterport3D dataset from two unordered panoramic inputs. From left to right: $\pi^3$, $\pi^3$*, $\pi^3{}^\dagger$, and PanoVGGT. Here, $\dagger$ denotes the original pinhole-only $\pi^3$ applied to panoramic inputs via MoGe's dodecahedral projection protocol, which decomposes each panorama into 12 perspective views. PanoVGGT produces sharper and more structurally consistent indoor reconstructions.
Figure 4: Multi-view point-cloud reconstructions on the Stanford2D3D dataset armeni2017joint using two unordered panoramic inputs. From left to right: $\pi^3$wang2025pi3, $\pi^3$* wang2025pi3, $\pi^3$$^\dagger$wang2025pi3, and PanoVGGT. Our method achieves higher geometric accuracy and cross-view consistency than the baselines.
Figure 5: Multi-view point-cloud reconstructions on the PanoCity dataset using ten unordered panoramic inputs. From left to right: $\pi^3$wang2025pi3, $\pi^3$* wang2025pi3, $\pi^3$$^\dagger$wang2025pi3, and PanoVGGT. The baseline methods struggle to learn accurate geometry on this long-trajectory setup, whereas PanoVGGT reconstructs large-scale outdoor scenes with coherent structure and accurate alignment across views.
...and 1 more figures

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Abstract

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Authors

Abstract

Table of Contents

Figures (6)