Table of Contents
Fetching ...

OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu

TL;DR

OmniVGGT addresses the limitation of RGB-only inputs in 3D foundation models by enabling arbitrary geometric modalities during training and inference. It introduces a GeoAdapter that hierarchically injects depth and camera parameters using zero-initialized convolutions to preserve representation stability, coupled with a stochastic multimodal fusion strategy to learn robust spatial features. The approach achieves state-of-the-art results across monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation, and extends to vision-language-action robotics where depth and pose cues improve spatial reasoning and manipulation. The work demonstrates practical benefits, maintaining inference efficiency while enabling flexible modality usage, and provides extensive ablation and cross-domain evaluations.

Abstract

General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.

OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

TL;DR

OmniVGGT addresses the limitation of RGB-only inputs in 3D foundation models by enabling arbitrary geometric modalities during training and inference. It introduces a GeoAdapter that hierarchically injects depth and camera parameters using zero-initialized convolutions to preserve representation stability, coupled with a stochastic multimodal fusion strategy to learn robust spatial features. The approach achieves state-of-the-art results across monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation, and extends to vision-language-action robotics where depth and pose cues improve spatial reasoning and manipulation. The work demonstrates practical benefits, maintaining inference efficiency while enabling flexible modality usage, and provides extensive ablation and cross-domain evaluations.

Abstract

General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.

Paper Structure

This paper contains 26 sections, 13 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: We proposed OmniVGGT, a spatial foundation model that can effectively benefit from an arbitrary number of auxiliary geometric modalities (depth, camera intrinsics & pose) to obtain high-quality 3D geometric results. Experimental results show that OmniVGGT achieves state-of-the-art performance across various downstream tasks and further improves performance on robot manipulation tasks.
  • Figure 2: Overview of OmniVGGT. OmniVGGT takes as input a set of images together with an arbitrary number of corresponding camera parameters (poses and intrinsics) or depth maps. Camera placeholder tokens and depth placeholder tokens are used to substitute the tokens for which auxiliary information is missing. The inputs are processed through $L$ layers of Alternating-Attention, and finally, three prediction heads are employed to output depth maps, camera poses, and 3D point maps.
  • Figure 3: Visual Results of OmniVGGT with Different Auxiliary Information. (Top) Camera information help correct challenging scenarios with little or no overlap. (Middle) Providing depth information leads to more accurate local geometry, such as on door surfaces. (Bottom) When both depth and camera information are provided, the relative distances and viewing angles are properly corrected.
  • Figure 4: Visual Comparisons on 7-Scenes shotton2013scene, NRGBD azinovic2022neural, and ETH3D schops2017multi datasets. OmniVGGT exhibits accurate spatial relationships and geometric consistency, even in extremely challenging cases. More examples can be found in the appendix.
  • Figure 5: Visualization of our GeoAdapter module in ablation.
  • ...and 6 more figures