Table of Contents
Fetching ...

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

Luca Bartolomei, Fabio Tosi, Matteo Poggi, Stefano Mattoccia

TL;DR

Stereo Anywhere tackles the generalization gaps in stereo matching by fusing traditional stereo geometry with monocular depth priors from Vision Foundation Models in a dual-branch architecture. The method builds two correlation volumes (stereo and monocular priors), augments and truncates them, and iteratively refines disparity through a RAFT-inspired framework, guided by differentiable monocular scaling. The MonoTrap dataset provides a rigorous testbed for optical illusions that challenge monocular predictors, while extensive zero-shot and non-Lambertian experiments demonstrate state-of-the-art generalization and robustness to mirrors, transparency, and textureless regions. The work shows that leveraging robust monocular priors within a principled stereo framework can achieve reliable depth estimation across diverse real-world scenarios without requiring large-scale real stereo data.

Abstract

We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

TL;DR

Stereo Anywhere tackles the generalization gaps in stereo matching by fusing traditional stereo geometry with monocular depth priors from Vision Foundation Models in a dual-branch architecture. The method builds two correlation volumes (stereo and monocular priors), augments and truncates them, and iteratively refines disparity through a RAFT-inspired framework, guided by differentiable monocular scaling. The MonoTrap dataset provides a rigorous testbed for optical illusions that challenge monocular predictors, while extensive zero-shot and non-Lambertian experiments demonstrate state-of-the-art generalization and robustness to mirrors, transparency, and textureless regions. The work shows that leveraging robust monocular priors within a principled stereo framework can achieve reliable depth estimation across diverse real-world scenarios without requiring large-scale real stereo data.

Abstract

We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.

Paper Structure

This paper contains 27 sections, 26 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Stereo Anywhere: Combining Monocular and Stereo Strenghts for Robust Depth Estimation. Our model achieves accurate results on standard conditions (on Middlebury scharstein2014high), while effectively handling non-Lambertian surfaces where stereo networks fail (on Booster zamaramirez2022booster) and perspective illusions that deceive monocular depth foundation models (on MonoTrap, our novel dataset).
  • Figure 2: Stereo Anywhere Architecture. Given a stereo pair, (1) a pre-trained backbone is used to extract features and then build a correlation volume. Such a volume is then truncated (2) to reject matching costs computed for disparity hypotheses being behind non-Lambertian surfaces -- glasses and mirrors. On a parallel branch, the two images are processed by a monocular VFM to obtain two depth maps (3): these are used to build a second correlation volume from retrieved normals (4). This volume is then aggregated through a 3D CNN to predict a new disparity map, used to align the original monocular depth to metric scale through a differentiable scaling module (5) for it. In parallel, the monocular depth map from left images is processed by another backbone (6) to extract context features. Finally, the two volumes and the context features from monocular depth guide the iterative disparity prediction (7).
  • Figure 3: Samples from MonoTrap Dataset. We report two scenes featured in our dataset, showing the left image, the ground-truth depth, and the predictions by Depth Anything v2 depth_anything_v2, highlighting how it fails in the presence of visual illusions.
  • Figure 4: Qualitative Results -- Zero-Shot Generalization. Predictions by state-of-the-art models and Stereo Anywhere.
  • Figure 5: Qualitative results -- Zero-Shot non-Lambertian Generalization. Predictions by state-of-the-art models and Stereo Anywhere.
  • ...and 15 more figures