Table of Contents
Fetching ...

All-in-One: Transferring Vision Foundation Models into Stereo Matching

Jingyi Zhou, Haoyu Zhang, Jiakang Yuan, Peng Ye, Tao Chen, Hao Jiang, Meiya Chen, Yangyang Zhang

TL;DR

The paper addresses the limited generalization of stereo matching encoders by leveraging multiple Vision Foundation Models (VFMs) to enrich feature representations. It introduces AIO-Stereo, a framework that uses dual-level knowledge utilization and dual-level selective knowledge transfer to align and fuse heterogeneous VFM features into a single stereo model. The approach demonstrates state-of-the-art performance on Middlebury and ETH3D, as well as strong zero-shot generalization, highlighting the practical impact of integrating broad visual priors into depth estimation. By effectively orchestrating diverse VFMs, the work shows substantial improvements in texture-rich and low-texture regions, enabling more robust depth predictions in real-world settings.

Abstract

As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1^{st}$ on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

All-in-One: Transferring Vision Foundation Models into Stereo Matching

TL;DR

The paper addresses the limited generalization of stereo matching encoders by leveraging multiple Vision Foundation Models (VFMs) to enrich feature representations. It introduces AIO-Stereo, a framework that uses dual-level knowledge utilization and dual-level selective knowledge transfer to align and fuse heterogeneous VFM features into a single stereo model. The approach demonstrates state-of-the-art performance on Middlebury and ETH3D, as well as strong zero-shot generalization, highlighting the practical impact of integrating broad visual priors into depth estimation. By effectively orchestrating diverse VFMs, the work shows substantial improvements in texture-rich and low-texture regions, enabling more robust depth predictions in real-world settings.

Abstract

As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

Paper Structure

This paper contains 32 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) The overview of AIO-Stereo which transfers selected knowledge from multiple VFMs to a single stereo matching model. (b) Comparisons between Selective-IGEV and our AIO-Stereo in dark and low texture areas.
  • Figure 2: The Overall framework of AIO-Stereo. Left: AIO-Stereo selectively learns knowledge from SAM, DINO and Depth Anything by the proposed dual-level selective knowledge transfer module. Right: The detailed structure of our proposed dual-level selective knowledge transfer module.
  • Figure 3: Visual comparison on the Middlebury dataset.
  • Figure 4: Visualization of the selection weights for each VFM. (a) Reference image. (b-d) Selection weights of DINO, SAM, and Depth Anything respectively.