Table of Contents
Fetching ...

LeanStereo: A Leaner Backbone based Stereo Network

Rafia Rahim, Samuel Woerz, Andreas Zell

TL;DR

LeanStereo tackles the need for real-time, accurate stereo depth estimation by employing a lean two-branch backbone, an attention-refined cost volume, and a LogL1 loss to compensate for reduced representational capacity. The method achieves substantial speedups—up to $9$–$14\times$ faster than leading 3D stereo networks—while maintaining competitive accuracy on SceneFlow and KITTI2015. Key contributions include the two-branch backbone design, attention-based cost volume refinement, and the LogL1 loss that enhances convergence and small-disparity accuracy. This work demonstrates that carefully designed lightweight architectures with task-tailored losses can approach the performance of heavier 3D networks, enabling practical deployment in real-time systems.

Abstract

Recently, end-to-end deep networks based stereo matching methods, mainly because of their performance, have gained popularity. However, this improvement in performance comes at the cost of increased computational and memory bandwidth requirements, thus necessitating specialized hardware (GPUs); even then, these methods have large inference times compared to classical methods. This limits their applicability in real-world applications. Although we desire high accuracy stereo methods albeit with reasonable inference time. To this end, we propose a fast end-to-end stereo matching method. Majority of this speedup comes from integrating a leaner backbone. To recover the performance lost because of a leaner backbone, we propose to use learned attention weights based cost volume combined with LogL1 loss for stereo matching. Using LogL1 loss not only improves the overall performance of the proposed network but also leads to faster convergence. We do a detailed empirical evaluation of different design choices and show that our method requires 4x less operations and is also about 9 to 14x faster compared to the state of the art methods like ACVNet [1], LEAStereo [2] and CFNet [3] while giving comparable performance.

LeanStereo: A Leaner Backbone based Stereo Network

TL;DR

LeanStereo tackles the need for real-time, accurate stereo depth estimation by employing a lean two-branch backbone, an attention-refined cost volume, and a LogL1 loss to compensate for reduced representational capacity. The method achieves substantial speedups—up to faster than leading 3D stereo networks—while maintaining competitive accuracy on SceneFlow and KITTI2015. Key contributions include the two-branch backbone design, attention-based cost volume refinement, and the LogL1 loss that enhances convergence and small-disparity accuracy. This work demonstrates that carefully designed lightweight architectures with task-tailored losses can approach the performance of heavier 3D networks, enabling practical deployment in real-time systems.

Abstract

Recently, end-to-end deep networks based stereo matching methods, mainly because of their performance, have gained popularity. However, this improvement in performance comes at the cost of increased computational and memory bandwidth requirements, thus necessitating specialized hardware (GPUs); even then, these methods have large inference times compared to classical methods. This limits their applicability in real-world applications. Although we desire high accuracy stereo methods albeit with reasonable inference time. To this end, we propose a fast end-to-end stereo matching method. Majority of this speedup comes from integrating a leaner backbone. To recover the performance lost because of a leaner backbone, we propose to use learned attention weights based cost volume combined with LogL1 loss for stereo matching. Using LogL1 loss not only improves the overall performance of the proposed network but also leads to faster convergence. We do a detailed empirical evaluation of different design choices and show that our method requires 4x less operations and is also about 9 to 14x faster compared to the state of the art methods like ACVNet [1], LEAStereo [2] and CFNet [3] while giving comparable performance.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparison of LeanStereo with other state of the art methods. Our proposed method is faster and has comparable performance with state of the art 3D methods.
  • Figure 2: Proposed Architecture. Here LF and RF represent the left and right image features extracted via the backbone network ('Concat' means concatenation of LF and RF). HG1 and HG2 represent the first and second hour-glass of the cost aggregation module. Out0, Out1 and Out2 are the outputs from cost aggregation and are used to regress the disparities.
  • Figure 3: Qualitative results on sample SceneFlow images. Here the first and third rows represent the error maps w.r.t. ground truth. Darker red and blue colors represent higher and lower disparity errors, respectively.
  • Figure 4: Qualitative performance (disparity images together with error maps) from the KITTI2015 benchmark. Warmer colors in error maps represent higher disparity errors.