Emergent Extreme-View Geometry in 3D Foundation Models
Yiwen Zhang, Joseph Tung, Ruojin Cai, David Fouhey, Hadar Averbuch-Elor
TL;DR
This work reveals that 3D foundation models inherently encode a 3D language enabling reasoning under extreme, non-overlapping viewpoints. It introduces a lightweight rotation-based alignment that updates only backbone biases (about 80k parameters) while keeping all decoder heads frozen, yielding substantial gains in extreme-view relative rotation estimation without harming per-image depth or point quality. A new benchmark, MegaUnScene, provides unseen internet scenes to evaluate both relative pose and dense reconstruction in unconstrained settings, and the method advances state-of-the-art rotation performance across multiple 3DFMs. The results demonstrate that minimal, targeted fine-tuning can unlock robust 3D reasoning in large-scale, generative vision models with broad real-world impact for navigation, AR/VR, and large-scale scene understanding.
Abstract
3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.
