Table of Contents
Fetching ...

DCVSMNet: Double Cost Volume Stereo Matching Network

Mahmoud Tahmasebi, Saif Huq, Kevin Meehan, Marion McAfee

TL;DR

DCVSMNet tackles the trade-off between speed and accuracy in stereo matching by introducing two small cost volumes processed in parallel, each encoding complementary geometric information. A coupling module fuses the geometry from both branches, enabling a single-stage disparity estimation that rivaled multi-stage refinements while maintaining fast inference (~67 ms). The approach demonstrates strong generalization across real-world datasets (KITTI, ETH3D, Middlebury) despite training primarily on SceneFlow, and outperforms several fast-state methods as well as some higher-accuracy models on benchmark tasks. This work highlights how structured fusion of diverse cost-volume representations can enhance depth estimation in practical, time-constrained scenarios, with potential for further speedups via lighter backbones and cost-volume pruning.

Abstract

We introduce Double Cost Volume Stereo Matching Network(DCVSMNet) which is a novel architecture characterised by by two small upper (group-wise) and lower (norm correlation) cost volumes. Each cost volume is processed separately, and a coupling module is proposed to fuse the geometry information extracted from the upper and lower cost volumes. DCVSMNet is a fast stereo matching network with a 67 ms inference time and strong generalization ability which can produce competitive results compared to state-of-the-art methods. The results on several bench mark datasets show that DCVSMNet achieves better accuracy than methods such as CGI-Stereo and BGNet at the cost of greater inference time.

DCVSMNet: Double Cost Volume Stereo Matching Network

TL;DR

DCVSMNet tackles the trade-off between speed and accuracy in stereo matching by introducing two small cost volumes processed in parallel, each encoding complementary geometric information. A coupling module fuses the geometry from both branches, enabling a single-stage disparity estimation that rivaled multi-stage refinements while maintaining fast inference (~67 ms). The approach demonstrates strong generalization across real-world datasets (KITTI, ETH3D, Middlebury) despite training primarily on SceneFlow, and outperforms several fast-state methods as well as some higher-accuracy models on benchmark tasks. This work highlights how structured fusion of diverse cost-volume representations can enhance depth estimation in practical, time-constrained scenarios, with potential for further speedups via lighter backbones and cost-volume pruning.

Abstract

We introduce Double Cost Volume Stereo Matching Network(DCVSMNet) which is a novel architecture characterised by by two small upper (group-wise) and lower (norm correlation) cost volumes. Each cost volume is processed separately, and a coupling module is proposed to fuse the geometry information extracted from the upper and lower cost volumes. DCVSMNet is a fast stereo matching network with a 67 ms inference time and strong generalization ability which can produce competitive results compared to state-of-the-art methods. The results on several bench mark datasets show that DCVSMNet achieves better accuracy than methods such as CGI-Stereo and BGNet at the cost of greater inference time.
Paper Structure (19 sections, 8 equations, 7 figures, 5 tables)

This paper contains 19 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of DCVSMNet with state-of-the-art methods on SceneFlow dataset.
  • Figure 2: DCVSMNet uses two cost volumes to store rich matching cost information. Each volume is processed using a 3D hourglass network. The geometry information extracted from the upper and lower cost volume is fused by a coupling module and the final disparity map is generated by regressing the summation of the upper and lower branch outputs
  • Figure 3: Baseline and single cost volume architecture
  • Figure 4: Qualitative results on KITTI 2012. Note how the model is able to recover fine details.
  • Figure 5: Qualitative results on KITTI 2015. Note how the model is able to recover fine details.
  • ...and 2 more figures