Table of Contents
Fetching ...

StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, Liujuan Cao

Abstract

Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.

StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

Abstract

Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.

Paper Structure

This paper contains 27 sections, 21 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Camera focal length serves as a determinative factor in disparity estimation. While existing stereo vision backbones lack targeted learning of camera geometry, causing them to largely neglect this critical prior. Given that VGGT is explicitly trained on 3D geometric priors such as camera poses, we seek to its inherent capacity for encoding camera pose representations. However, we observe that VGGT tends to excessively degrade the structural contours within the feature maps. This smoothing property is architecturally incompatible with the pixel-accurate alignment demands of stereo vision, creating a bottleneck for downstream stereo applications. StereoVGGT integrates the camera geometry knowledge encoded in VGGT while preserving robust feature representation capabilities. It can serves as a highly effective backbone for stereo vision.
  • Figure 2: Blue histograms visualize the median and mean camera FOV errors obtained when applying different frameworks. This analysis evaluates camera FOV on the ETH3D dataset.
  • Figure 3: VGGT suffers from spatial-structural degradation during its feature processing. (a) Histograms illustrate the SSIM values calculated between the extracted feature maps and the original input images across the ETH3D, KITTI, and Middlebury datasets. (b) Feature maps visualization. The red bounding boxes highlight the vehicle contours extracted by each respective method.
  • Figure 4: StereoVGGT architecture. StereoVGGT comprises three main stages. First, EMWM synthesizes a new set of optimized DINO weights by merging the weights of VGGT, DINOv2, and an MDE model, guided by an entropy-based criterion. Second, the patch tokens generated by the re-weighted DINO are concurrently fed into both frozen VGGT Frame Attention (FA) Blocks and an MDE neck. The VGGT FA features subsequently modulate modulate the MDE neck features, thereby achieving a balance between camera-geometry knowledge and fine-grained image-feature representation. Finally, the resulting latent features $X_{stereovggt}$ are passed through a DPT head to generate the disparity prior $d_{stereovggt}$. Both the latent features and the disparity prior can then be leveraged in downstream stereo-vision tasks, including stereo conversion and stereo matching.
  • Figure 5: Visualization comparision on the KITTI dataset. The text in parentheses indicates the feature backbone used by each method. The preceding examples show that both input scenes contain gaps between the signs and poles. A key difference is that IGEV-Stereo and AiO-Stereo fail to reconstruct these holes, whereas our method successfully recognizes them as part of the distant background.
  • ...and 5 more figures