Table of Contents
Fetching ...

Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video

Zihui Gao, Ke Liu, Donny Y. Chen, Duochao Shi, Guosheng Lin, Hao Chen, Chunhua Shen

TL;DR

3D geometric foundation models are constrained by scarce diverse annotations. SAGE enables scalable adaptation from unlabelled Internet video by combining spatio-temporal trajectory mining, sparse COLMAP anchors, dense differentiable 3D Gaussian consistency, and anchor-based regularization to prevent forgetting. It demonstrates 20–42% reductions in Chamfer Distance on unseen benchmarks and shows strong zero-shot generalization as video data scales to 10K scenes, validating Internet video as a scalable resource for general-purpose 3D learning. The approach also yields improvements in pose estimation and outdoor scene robustness, highlighting practical impact for scalable 3D understanding in real-world settings.

Abstract

Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.

Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video

TL;DR

3D geometric foundation models are constrained by scarce diverse annotations. SAGE enables scalable adaptation from unlabelled Internet video by combining spatio-temporal trajectory mining, sparse COLMAP anchors, dense differentiable 3D Gaussian consistency, and anchor-based regularization to prevent forgetting. It demonstrates 20–42% reductions in Chamfer Distance on unseen benchmarks and shows strong zero-shot generalization as video data scales to 10K scenes, validating Internet video as a scalable resource for general-purpose 3D learning. The approach also yields improvements in pose estimation and outdoor scene robustness, highlighting practical impact for scalable 3D understanding in real-world settings.

Abstract

Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.
Paper Structure (52 sections, 6 equations, 11 figures, 10 tables)

This paper contains 52 sections, 6 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Overview of the SAGE framework for scaling a General-Purpose 3D Foundation Model. To overcome the Data Scarcity Bottleneck inherent in limited labeled 3D datasets, we propose a pipeline that leverages unlimited Internet videos to achieve robust zero-shot generalization on complex unseen scenes. Bottom right: Scalability analysis demonstrates zero-shot reconstruction performance as training data scales from 100 to 10K video scenes, showing consistent improvement with increased data volume across various benchmarks (MP3D, 7-Scenes, and TUM).
  • Figure 2: Illustration of SAGE. For each video sequence, we sample context frames as model inputs and designate target frames for novel-view supervision, providing photometric constraints to refine the reconstructed 3D point cloud. Furthermore, we incorporate sparse 3D point clouds that provide consistent geometric constraints, complementing the photometric supervision from the target views.
  • Figure 3: Empirical analysis of training data difficulty and its impact on model generalization. (Left) Training distribution across three datasets. (Right) Generalization performance (CD ↓) on three test sets under various mixing ratios of Re10K to DL3DV samples. Subplots indicate that while pure medium-hardness data (1:0) benefits simple scenes, a balanced mixture (1:1) yields the most robust generalization on large-scale environments like MP3D. The dashed red lines denote the pre-trained baseline performance.
  • Figure 4: Qualitative comparison of reconstructed point clouds across different methods, alongside ground-truth geometry. Zoomed-in views of representative regions are shown on the right. (Note: The 7Scenes ground-truth may appear slightly misaligned due to minor pose inaccuracies in the dataset.)
  • Figure 5: Standard 3D GFM Training and Inference Framework. A generic 3D GFM consists of an image encoder, a decoder for cross-view feature interaction, and optional output heads that regress various geometric representations. These models are commonly trained with explicit 3D supervision.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 3.1: pose-free sparse-view 3D reconstruction