OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

Haosong Peng; Hao Li; Yalun Dai; Yushi Lan; Yihang Luo; Tianyu Qi; Zhengshen Zhang; Yufeng Zhan; Junfei Zhang; Wenchao Xu; Ziwei Liu

OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu

TL;DR

OmniVGGT addresses the limitation of RGB-only inputs in 3D foundation models by enabling arbitrary geometric modalities during training and inference. It introduces a GeoAdapter that hierarchically injects depth and camera parameters using zero-initialized convolutions to preserve representation stability, coupled with a stochastic multimodal fusion strategy to learn robust spatial features. The approach achieves state-of-the-art results across monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation, and extends to vision-language-action robotics where depth and pose cues improve spatial reasoning and manipulation. The work demonstrates practical benefits, maintaining inference efficiency while enabling flexible modality usage, and provides extensive ablation and cross-domain evaluations.

Abstract

General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.

OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

TL;DR

Abstract

OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)