Table of Contents
Fetching ...

S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models

Xinze Li, Pengxu Chen, Yiyuan Wang, Weifeng Su, Wentao Cheng

Abstract

Feed-forward 3D foundation models face a key challenge: the quadratic computational cost introduced by global attention, which severely limits scalability as input length increases. Concurrent acceleration methods, such as token merging, operate at the token level. While they offer local savings, the required nearest-neighbor searches introduce undesirable overhead. Consequently, these techniques fail to tackle the fundamental issue of structural redundancy dominant in dense capture data. In this work, we introduce \textbf{S-VGGT}, a novel approach that addresses redundancy at the structural frame level, drastically shifting the optimization focus. We first leverage the initial features to build a dense scene graph, which characterizes structural scene redundancy and guides the subsequent scene partitioning. Using this graph, we softly assign frames to a small number of subscenes, guaranteeing balanced groups and smooth geometric transitions. The core innovation lies in designing the subscenes to share a common reference frame, establishing a parallel geometric bridge that enables independent and highly efficient processing without explicit geometric alignment. This structural reorganization provides strong intrinsic acceleration by cutting the global attention cost at its source. Crucially, S-VGGT is entirely orthogonal to token-level acceleration methods, allowing the two to be seamlessly combined for compounded speedups without compromising reconstruction fidelity. Code is available at https://github.com/Powertony102/S-VGGT.

S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models

Abstract

Feed-forward 3D foundation models face a key challenge: the quadratic computational cost introduced by global attention, which severely limits scalability as input length increases. Concurrent acceleration methods, such as token merging, operate at the token level. While they offer local savings, the required nearest-neighbor searches introduce undesirable overhead. Consequently, these techniques fail to tackle the fundamental issue of structural redundancy dominant in dense capture data. In this work, we introduce \textbf{S-VGGT}, a novel approach that addresses redundancy at the structural frame level, drastically shifting the optimization focus. We first leverage the initial features to build a dense scene graph, which characterizes structural scene redundancy and guides the subsequent scene partitioning. Using this graph, we softly assign frames to a small number of subscenes, guaranteeing balanced groups and smooth geometric transitions. The core innovation lies in designing the subscenes to share a common reference frame, establishing a parallel geometric bridge that enables independent and highly efficient processing without explicit geometric alignment. This structural reorganization provides strong intrinsic acceleration by cutting the global attention cost at its source. Crucially, S-VGGT is entirely orthogonal to token-level acceleration methods, allowing the two to be seamlessly combined for compounded speedups without compromising reconstruction fidelity. Code is available at https://github.com/Powertony102/S-VGGT.
Paper Structure (17 sections, 7 equations, 5 figures, 2 tables)

This paper contains 17 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison of VGGT (2.69 FPS) and S-VGGT (10.13 FPS) on a 500-frame scene. S-VGGT achieves a significant speedup by processing subscenes in parallel while maintaining reconstruction quality.
  • Figure 2: The framework of S-VGGT. The input frames are first embedded into tokens, and frame similarity is calculated to assess redundancy. Frames are then grouped into subscenes via soft assignment, ensuring parallel processing. A shared reference frame across subscenes enables efficient global and frame attention operations, with the model architecture based on VGGT vggt.
  • Figure 3: Qualitative comparison of camera pose estimation performance between S-VGGT and VGGT$^*$.
  • Figure 4: Compounded Speedup (vs. VGGT$^*$) on NRGBD nrgbd Results validate the complementarity of frame-level S-VGGT ("Ours") and token-level FastVGGT ("Fast"), showing enhanced acceleration across varying sequence lengths.
  • Figure 5: Time breakdown of VGGT$^{*}$ vs. S-VGGT.