Table of Contents
Fetching ...

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, Yujiao Shi

Abstract

Panoramic imagery offers a full 360° field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Abstract

Panoramic imagery offers a full 360° field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.
Paper Structure (22 sections, 11 equations, 6 figures, 10 tables)

This paper contains 22 sections, 11 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of PanoVGGT and the PanoCity dataset. PanoVGGT predicts camera poses, dense depth, and consistent 3D point clouds from unordered panoramas. Compared with perspective-based pipelines that split panoramas into pinhole views, it yields more accurate 3D reconstructions. PanoCity contains over 120k outdoor panoramas under diverse urban scenes and weather conditions.
  • Figure 2: Overview of the proposed PanoVGGT framework. Given one or multiple panoramic images, PanoVGGT extracts spherical patch tokens using a ViT backbone, augments them via $SO(3)$ geometric augmentation, and encodes them with shared and branch-adapted positional embeddings. A multi-branch attention aggregator jointly reasons about local geometry, global structure, and camera motion. The model outputs dense depth maps, camera poses (under a randomly selected global anchor), and local/global point maps for final 3D reconstruction.
  • Figure 3: Multi-view point-cloud reconstructions on the Matterport3D dataset from two unordered panoramic inputs. From left to right: $\pi^3$, $\pi^3$*, $\pi^3{}^\dagger$, and PanoVGGT. Here, $\dagger$ denotes the original pinhole-only $\pi^3$ applied to panoramic inputs via MoGe's dodecahedral projection protocol, which decomposes each panorama into 12 perspective views. PanoVGGT produces sharper and more structurally consistent indoor reconstructions.
  • Figure 4: Multi-view point-cloud reconstructions on the Stanford2D3D dataset armeni2017joint using two unordered panoramic inputs. From left to right: $\pi^3$wang2025pi3, $\pi^3$* wang2025pi3, $\pi^3$$^\dagger$wang2025pi3, and PanoVGGT. Our method achieves higher geometric accuracy and cross-view consistency than the baselines.
  • Figure 5: Multi-view point-cloud reconstructions on the PanoCity dataset using ten unordered panoramic inputs. From left to right: $\pi^3$wang2025pi3, $\pi^3$* wang2025pi3, $\pi^3$$^\dagger$wang2025pi3, and PanoVGGT. The baseline methods struggle to learn accurate geometry on this long-trajectory setup, whereas PanoVGGT reconstructs large-scale outdoor scenes with coherent structure and accurate alignment across views.
  • ...and 1 more figures