Table of Contents
Fetching ...

Playing to Vision Foundation Model's Strengths in Stereo Matching

Chuang-Wei Liu, Qijun Chen, Rui Fan

TL;DR

This work introduces ViTAS, a vision foundation model (VFM) adapter for stereo matching that preserves cost-volume back-ends while leveraging general-purpose VFM features. It decomposes ViTAS into a spatial-differentiation module, a patch attention fusion module, and a cross-attention module, enabling multi-scale feature pyramids and stereo-aware fusion; the resulting ViTAStereo achieves state-of-the-art results on KITTI Stereo 2012 and strong generalization across datasets. Key findings show that the PAFM component offers the largest performance gains, that cost volumes remain crucial for generalizability, and that ViTAS generalizes well with existing SoTA back-ends like IGEV-Stereo and GMStereo. The approach demonstrates that adapting VFMs via lightweight, modular adapters can significantly improve stereo matching accuracy while maintaining compatibility and interpretability through cost-volume structures, with potential for broader adoption in dense geometric vision tasks.

Abstract

Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.

Playing to Vision Foundation Model's Strengths in Stereo Matching

TL;DR

This work introduces ViTAS, a vision foundation model (VFM) adapter for stereo matching that preserves cost-volume back-ends while leveraging general-purpose VFM features. It decomposes ViTAS into a spatial-differentiation module, a patch attention fusion module, and a cross-attention module, enabling multi-scale feature pyramids and stereo-aware fusion; the resulting ViTAStereo achieves state-of-the-art results on KITTI Stereo 2012 and strong generalization across datasets. Key findings show that the PAFM component offers the largest performance gains, that cost volumes remain crucial for generalizability, and that ViTAS generalizes well with existing SoTA back-ends like IGEV-Stereo and GMStereo. The approach demonstrates that adapting VFMs via lightweight, modular adapters can significantly improve stereo matching accuracy while maintaining compatibility and interpretability through cost-volume structures, with potential for broader adoption in dense geometric vision tasks.

Abstract

Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.
Paper Structure (19 sections, 4 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration of our proposed ViTAS, consisting of an SDM, four CAMs, and three PAFMs for each sub-network.
  • Figure 2: Illustrations of (a) local path attention versus conventional global attention, (b) quasi-global attention, and (c) multi-scale feature aggregation within PAFM.
  • Figure 3: Ablation studies on (a) the optimal configuration for CAM attention blocks and (b) the most suitable number of unfrozen VFM blocks.
  • Figure 4: Qualitative experimental results on the KITTI Stereo datasets geiger2012wemenze2015object, where significantly improved regions are shown with pink dashed boxes.
  • Figure 5: Qualitative experimental results on the KITTI Eval dataset. Significantly improved regions are shown in pink dashed boxes.
  • ...and 3 more figures