Table of Contents
Fetching ...

Emergent Extreme-View Geometry in 3D Foundation Models

Yiwen Zhang, Joseph Tung, Ruojin Cai, David Fouhey, Hadar Averbuch-Elor

TL;DR

This work reveals that 3D foundation models inherently encode a 3D language enabling reasoning under extreme, non-overlapping viewpoints. It introduces a lightweight rotation-based alignment that updates only backbone biases (about 80k parameters) while keeping all decoder heads frozen, yielding substantial gains in extreme-view relative rotation estimation without harming per-image depth or point quality. A new benchmark, MegaUnScene, provides unseen internet scenes to evaluate both relative pose and dense reconstruction in unconstrained settings, and the method advances state-of-the-art rotation performance across multiple 3DFMs. The results demonstrate that minimal, targeted fine-tuning can unlock robust 3D reasoning in large-scale, generative vision models with broad real-world impact for navigation, AR/VR, and large-scale scene understanding.

Abstract

3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.

Emergent Extreme-View Geometry in 3D Foundation Models

TL;DR

This work reveals that 3D foundation models inherently encode a 3D language enabling reasoning under extreme, non-overlapping viewpoints. It introduces a lightweight rotation-based alignment that updates only backbone biases (about 80k parameters) while keeping all decoder heads frozen, yielding substantial gains in extreme-view relative rotation estimation without harming per-image depth or point quality. A new benchmark, MegaUnScene, provides unseen internet scenes to evaluate both relative pose and dense reconstruction in unconstrained settings, and the method advances state-of-the-art rotation performance across multiple 3DFMs. The results demonstrate that minimal, targeted fine-tuning can unlock robust 3D reasoning in large-scale, generative vision models with broad real-world impact for navigation, AR/VR, and large-scale scene understanding.

Abstract

3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.

Paper Structure

This paper contains 28 sections, 9 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Do 3D foundation models have an emergent understanding of extreme-views? The pretrained VGGT model was trained primarily on overlapping images. Surprisingly, when tested on non-overlapping image pairs$^*$, the model still produces plausible estimates of relative pose, with nearly half of the pairs yielding a rotation error below $30^\circ$. Careful fine-tuning of a small number of parameters substantially improves results. Here, for instance, the pretrained model produces incorrect pose, yielding the red ghost structure); the fine-tuned model corrects the error. $^*$All from unseen scenes assembled in our MegaUnScene benchmark.
  • Figure 2: VGGT Cross-View Attention Maps. We visualize cross-view attention maps for three image pairs of varying overlap, from high overlap (left) to none (right). For each image pair, we highlight three query regions in $I_1$ with colored boxes (green, cyan, magenta). Solid boxes indicate region overlap, while dashed boxes indicate no overlap. $I_2$'s corresponding attention maps are shown at the bottom row with like colors. Reconstructed pointmaps are shown towards the right.
  • Figure 3: Rotation-based Alignment Framework. Above we illustrate our lightweight alignment scheme, which supervises the camera head of 3DFMs via a rotation loss $\mathcal{L}$ over predicted and ground truth relative rotation matrices. To preserve pretrained knowledge, we only update the bias terms of a sparse set of layers in the shared alternating attention (AA) backbone.
  • Figure 4: Metric scale visualization of UnSceneRecon scenes (L→R): Aghavnavank Monastery, Alamgiri Gate, Predjama Castle, and the Ritz Tower. For reference, the person is 2 meters tall.
  • Figure 5: Qualitative results over UnScenePairs-t. From left to right, we show the input image pair, spherical projection of relative rotations (black: reference view, blue: ground truth, red: pretrained VGGT, yellow: fine-tuned VGGT), and the corresponding reconstructions (sparse ground truth, dense pretrained and fine-tuned). Please refer to the supplementary material for many additional visualizations.
  • ...and 5 more figures