A Survey on Deep Stereo Matching in the Twenties

Fabio Tosi; Luca Bartolomei; Matteo Poggi

A Survey on Deep Stereo Matching in the Twenties

Fabio Tosi, Luca Bartolomei, Matteo Poggi

TL;DR

The survey maps the rapid evolution of deep stereo matching in the 2020s, organizing architectures into foundational, efficiency-focused, multi-task, and beyond-RGB categories to reflect prevailing design trends. It highlights RAFT-Stereo-inspired iterative refinement, Vision Transformer approaches, and neural MRFs as pivotal developments, while detailing efficiency techniques such as compact cost volumes and cascaded processing. The paper also surveys challenges including domain shift, over-smoothing, and non-Lambertian/asymmetric scenes, offering taxonomy-driven solutions and domain adaptation strategies, both offline and online. Through extensive benchmark analysis (KITTI2015, Middlebury v3, ROB, Booster), it demonstrates significant progress and clarifies remaining gaps, emphasizing the need for generalization, multimodal sensing, and scalable models. Overall, the work serves as a comprehensive guide to researchers and practitioners, guiding future work toward robust, efficient, and multimodal stereo systems with potential for foundational models in this domain.

Abstract

Stereo matching is close to hitting a half-century of history, yet witnessed a rapid evolution in the last decade thanks to deep learning. While previous surveys in the late 2010s covered the first stage of this revolution, the last five years of research brought further ground-breaking advancements to the field. This paper aims to fill this gap in a two-fold manner: first, we offer an in-depth examination of the latest developments in deep stereo matching, focusing on the pioneering architectural designs and groundbreaking paradigms that have redefined the field in the 2020s; second, we present a thorough analysis of the critical challenges that have emerged alongside these advances, providing a comprehensive taxonomy of these issues and exploring the state-of-the-art techniques proposed to address them. By reviewing both the architectural innovations and the key challenges, we offer a holistic view of deep stereo matching and highlight the specific areas that require further investigation. To accompany this survey, we maintain a regularly updated project page that catalogs papers on deep stereo matching in our Awesome-Deep-Stereo-Matching (https://github.com/fabiotosi92/Awesome-Deep-Stereo-Matching) repository.

A Survey on Deep Stereo Matching in the Twenties

TL;DR

Abstract

Paper Structure (45 sections, 1 equation, 5 figures, 4 tables)

This paper contains 45 sections, 1 equation, 5 figures, 4 tables.

Introduction
Background
Architectures
Foundational Deep Stereo Architectures
CNN-based Cost Volume Aggregation
Neural Architecture Search for Stereo Matching
Iterative Optimization-based Architectures
Vision Transformer-based Architectures
Markov Random Field-based Architectures
Efficiency-Oriented Architectures
Compact Cost Volume Representations
Efficient Cost Volume Processing
Compact Architectures
Multi-task Deep Architectures
Semantic Stereo Matching
...and 30 more sections

Figures (5)

Figure 2: A taxonomy of deep learning-based stereo matching architectures in the 2020s. We categorize the reviewed methods based on their key designs and paradigms.
Figure 3: RAFT-Stereo lipson2021raft architecture. It constructs a correlation pyramid from correlation features (blue) extracted from each image. A context encoder extracts "context" image features (white) and an initial hidden state. The disparity field starts at zero. In each iteration, the GRU(s) (green) sample from the correlation pyramid using the current disparity estimate. Sampled correlation features, initial image features, and current hidden state(s) are processed by the GRU(s) to update the hidden state and disparity estimate. Picture from lipson2021raft.
Figure 4: A taxonomy of the main challenges (and solutions) in deep stereo matching. For each, we highlight the key problem areas and novel techniques developed.
Figure 5: Qualitative comparison -- PSMNet variants. From left to right: reference image, disparity maps predicted by networks trained on synthetic data with ground-truth (Graft-PSMNet, ITSA-PSMNet) or on real data without any ground-truth (MfS-PSMNet, NS-PSMNet).
Figure 6: Bleeding Artifacts. The smooth disparities predicted between foreground and background objects project into flying points in 3D space (a,c), whereas precise 3D reconstructions demand sharp discontinuities (b,d).

A Survey on Deep Stereo Matching in the Twenties

TL;DR

Abstract

A Survey on Deep Stereo Matching in the Twenties

Authors

TL;DR

Abstract

Table of Contents

Figures (5)