Table of Contents
Fetching ...

FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, Stan Birchfield

TL;DR

FoundationStereo tackles the challenge of zero-shot generalization in stereo depth estimation by training a foundation-style stereo model on a large-scale, high-fidelity synthetic dataset and by incorporating monocular priors through a Side-Tuning Adapter. It introduces Attentive Hybrid Cost Filtering, combining Axial-Planar Convolution and a Disparity Transformer to enable long-range context within a 4D cost volume, followed by iterative GRU-based refinement for robust disparity estimates. A self-curation pipeline removes ambiguous samples from the synthetic data, yielding the FoundationStereo Dataset (FSD) that enhances robustness and cross-domain transfer. The approach achieves strong zero-shot performance across diverse real-world scenarios and ranks top on ETH3D and Middlebury leaderboards when fine-tuned, illustrating its practical impact for cross-domain stereo depth estimation.

Abstract

Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation. Project page: https://nvlabs.github.io/FoundationStereo/

FoundationStereo: Zero-Shot Stereo Matching

TL;DR

FoundationStereo tackles the challenge of zero-shot generalization in stereo depth estimation by training a foundation-style stereo model on a large-scale, high-fidelity synthetic dataset and by incorporating monocular priors through a Side-Tuning Adapter. It introduces Attentive Hybrid Cost Filtering, combining Axial-Planar Convolution and a Disparity Transformer to enable long-range context within a 4D cost volume, followed by iterative GRU-based refinement for robust disparity estimates. A self-curation pipeline removes ambiguous samples from the synthetic data, yielding the FoundationStereo Dataset (FSD) that enhances robustness and cross-domain transfer. The approach achieves strong zero-shot performance across diverse real-world scenarios and ranks top on ETH3D and Middlebury leaderboards when fine-tuned, illustrating its practical impact for cross-domain stereo depth estimation.

Abstract

Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation. Project page: https://nvlabs.github.io/FoundationStereo/
Paper Structure (23 sections, 5 equations, 9 figures, 10 tables)

This paper contains 23 sections, 5 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Zero-shot prediction on in-the-wild images. Our method generalizes to diverse scenarios (indoor / outdoor), objects of challenging properties (textureless / reflective / translucent / thin-structured), complex illuminations (shadow / strong exposure), various viewing perspectives and sensing ranges.
  • Figure 2: Overview of our proposed FoundationStereo. The Side-Tuning Adapter (STA) adapts the rich monocular priors from a frozen DepthAnythingV2 yang2024depthanythingv2, while combined with fine-grained high-frequency features from multi-level CNN for unary feature extraction. Attentive Hybrid Cost Filtering (AHCF) combines the strengths of the Axial-Planar Convolution (APC) filtering and a Disparity Transformer (DT) module to effectively aggregate the features along spatial and disparity dimensions over the 4D hybrid cost volume. An initial disparity is then predicted from the filtered cost volume, and subsequently refined through GRU blocks. At each refinement step, the latest disparity is used to look up features from both filtered hybrid cost volume and correlation volume to guide the next refinement. The iteratively refined disparity becomes the final output.
  • Figure 3: Left: Design choices for STA module. Right: Effects of the proposed STA and AHCF modules. "W/o STA" only uses CNN to extract features. "W/o AHCF" uses conventional 3D CNN-based hourglass network for cost volume filtering. Results are obtained via zero-shot inference without fine-tuning on target dataset. STA leverages rich monocular prior to reliably predict the lamp region with inconsistent lighting and dark guitar sound hole. AHCF effectively aggregates the spatial and long-range disparity context to accurately predict over thin repetitive structures.
  • Figure 4: Left: Samples from our FoundationStereo dataset (FSD), which consists of synthetic stereo images with structured indoor / outdoor scenes (top), as well as more randomized scenes with challenging flying objects and higher geometry and texture diversity (bottom). Right: The iterative self-curation process removes ambiguous samples inevitably produced from the domain randomized synthetic data generation process. Example ambiguities include severe texture repetition, ubiquitous reflections with limited surrounding context, and pure color under improper lighting.
  • Figure 5: Qualitative comparison of zero-shot inference on in-the-wild images. For each comparison method we select the best performing checkpoint from their public release, which has been trained on a mixture of public datasets. These images exhibit challenging reflection, translucency, repetitive textures, complex illuminations and thin-structures, revealing the importance of our network architecture and large-scale training.
  • ...and 4 more figures