Table of Contents
Fetching ...

Stereo Risk: A Continuous Modeling Approach to Stereo Matching

Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, Yao Yao, Luc Van Gool

TL;DR

This work reframes stereo matching as a continuous risk minimization problem, addressing the gap between continuous scene depth and discrete disparity hypotheses. By interpolating a discrete disparity distribution with a Laplacian kernel to form a continuous density p(x; p^m) and minimizing an $L^1$ risk, the method achieves robust performance for multi-modal disparity distributions. A differentiable forward-backward mechanism based on the implicit function theorem enables end-to-end training, despite the non-differentiable optimization step, and a two-stage cascade network efficiently handles large disparity ranges. Empirically, the approach delivers state-of-the-art results on SceneFlow and KITTI benchmarks and demonstrates strong cross-domain generalization to Middlebury and ETH3D, with ablations highlighting the advantages of $L^1$ risk and kernel choices. The method offers a principled, scalable pathway to accurate and robust stereo matching with practical implications for autonomous systems and robotic perception.

Abstract

We introduce Stereo Risk, a new deep-learning approach to solve the classical stereo-matching problem in computer vision. As it is well-known that stereo matching boils down to a per-pixel disparity estimation problem, the popular state-of-the-art stereo-matching approaches widely rely on regressing the scene disparity values, yet via discretization of scene disparity values. Such discretization often fails to capture the nuanced, continuous nature of scene depth. Stereo Risk departs from the conventional discretization approach by formulating the scene disparity as an optimal solution to a continuous risk minimization problem, hence the name "stereo risk". We demonstrate that $L^1$ minimization of the proposed continuous risk function enhances stereo-matching performance for deep networks, particularly for disparities with multi-modal probability distributions. Furthermore, to enable the end-to-end network training of the non-differentiable $L^1$ risk optimization, we exploited the implicit function theorem, ensuring a fully differentiable network. A comprehensive analysis demonstrates our method's theoretical soundness and superior performance over the state-of-the-art methods across various benchmark datasets, including KITTI 2012, KITTI 2015, ETH3D, SceneFlow, and Middlebury 2014.

Stereo Risk: A Continuous Modeling Approach to Stereo Matching

TL;DR

This work reframes stereo matching as a continuous risk minimization problem, addressing the gap between continuous scene depth and discrete disparity hypotheses. By interpolating a discrete disparity distribution with a Laplacian kernel to form a continuous density p(x; p^m) and minimizing an risk, the method achieves robust performance for multi-modal disparity distributions. A differentiable forward-backward mechanism based on the implicit function theorem enables end-to-end training, despite the non-differentiable optimization step, and a two-stage cascade network efficiently handles large disparity ranges. Empirically, the approach delivers state-of-the-art results on SceneFlow and KITTI benchmarks and demonstrates strong cross-domain generalization to Middlebury and ETH3D, with ablations highlighting the advantages of risk and kernel choices. The method offers a principled, scalable pathway to accurate and robust stereo matching with practical implications for autonomous systems and robotic perception.

Abstract

We introduce Stereo Risk, a new deep-learning approach to solve the classical stereo-matching problem in computer vision. As it is well-known that stereo matching boils down to a per-pixel disparity estimation problem, the popular state-of-the-art stereo-matching approaches widely rely on regressing the scene disparity values, yet via discretization of scene disparity values. Such discretization often fails to capture the nuanced, continuous nature of scene depth. Stereo Risk departs from the conventional discretization approach by formulating the scene disparity as an optimal solution to a continuous risk minimization problem, hence the name "stereo risk". We demonstrate that minimization of the proposed continuous risk function enhances stereo-matching performance for deep networks, particularly for disparities with multi-modal probability distributions. Furthermore, to enable the end-to-end network training of the non-differentiable risk optimization, we exploited the implicit function theorem, ensuring a fully differentiable network. A comprehensive analysis demonstrates our method's theoretical soundness and superior performance over the state-of-the-art methods across various benchmark datasets, including KITTI 2012, KITTI 2015, ETH3D, SceneFlow, and Middlebury 2014.
Paper Structure (26 sections, 11 equations, 6 figures, 16 tables, 1 algorithm)

This paper contains 26 sections, 11 equations, 6 figures, 16 tables, 1 algorithm.

Figures (6)

  • Figure 1: Qualitative Comparison. Comparison with state-of-the-art methods such as IGEV xu2023iterative, DLNR zhao2023high on Middlebury dataset. All methods are trained only on SceneFlow mayer2016large, and evaluated at quarter resolution. It can be observed that our method generalizes and predicts high-frequency details better than state-of-the-art methods.
  • Figure 2: Difference between the expectation based approach and our method. In (a) the pixel in the red circle is located at the boundary of the chair, thus the disparity distribution has multiple modes. (b) and (c) shows the discrete distribution of disparity hypotheses in orange bars. In (b) the prediction obtained by averaging is blurred and far from any of the modes. In (c) we obtained the optimal solution under $L^1$ norm, which is more robust and closer to the ground truth. The green curve is the interpolated probability density.
  • Figure 3: Overall pipeline (Left to Right). We first extract multi-scale features from left and right images respectively. The subsequent procedures are divided into two stages. In the coarse stage ---shown in orange arrow, we sample disparity hypotheses uniformly and match on 1/4-resolution features. While in the refined stage---shown in green arrow, to match 1/2-resolution features efficiently. Disparity hypotheses are sampled centering around the disparity predicted from the coarse stage. In both stages, we first construct cost volumes by concatenation, and then apply the stacked hourglass networks to aggregate the matching cost, and finally search for the disparity that minimizes the proposed $L^1$ risk in Eq.(\ref{['eq:l1_risk']}).
  • Figure 4: Qualitative Comparison. We compare our method with recent state-of-the-art methods such as IGEV xu2023iterative, DLNR zhao2023high on Middlebury scharstein2002taxonomy. All methods are trained only on SceneFlow mayer2016large, and evaluated at quarter resolution.
  • Figure 5: Qualitative Comparison. We compare our method with recent state-of-the-art methods such as IGEV xu2023iterative, PCWNet shen2022pcw on ETH 3D schoeps2017cvpr. All methods are trained only on SceneFlow mayer2016large.
  • ...and 1 more figures