Table of Contents
Fetching ...

A Flexible Recursive Network for Video Stereo Matching Based on Residual Estimation

Youchen Zhao, Guorong Luo, Hua Zhong, Haixiong Li

TL;DR

This work tackles real-time stereo matching in video by introducing RecSM, a recursive, residual-estimation framework that leverages temporal context to compute current-frame disparities from the previous frame. The core idea is a stackable construction where each SCS combines a Multi-scale Residual Estimation Module with a shared-weights Disparity Optimization Module, progressively refining disparities while keeping computation low. Empirical results on KITTI show that RecSM with three SCSs delivers about a 4x speedup over a strong baseline (ACVNet) with only a small (~0.7%) drop in accuracy, and ablations validate the effectiveness of the MREM, DOM, and the dynamic, stackable design. The approach promises practical impact for fast, autonomous-driving–oriented stereo vision, offering a tunable speed-accuracy tradeoff and a flexible deployment path across scenarios.

Abstract

Due to the high similarity of disparity between consecutive frames in video sequences, the area where disparity changes is defined as the residual map, which can be calculated. Based on this, we propose RecSM, a network based on residual estimation with a flexible recursive structure for video stereo matching. The RecSM network accelerates stereo matching using a Multi-scale Residual Estimation Module (MREM), which employs the temporal context as a reference and rapidly calculates the disparity for the current frame by computing only the residual values between the current and previous frames. To further reduce the error of estimated disparities, we use the Disparity Optimization Module (DOM) and Temporal Attention Module (TAM) to enforce constraints between each module, and together with MREM, form a flexible Stackable Computation Structure (SCS), which allows for the design of different numbers of SCS based on practical scenarios. Experimental results demonstrate that with a stack count of 3, RecSM achieves a 4x speed improvement compared to ACVNet, running at 0.054 seconds based on one NVIDIA RTX 2080TI GPU, with an accuracy decrease of only 0.7%. Code is available at https://github.com/Y0uchenZ/RecSM.

A Flexible Recursive Network for Video Stereo Matching Based on Residual Estimation

TL;DR

This work tackles real-time stereo matching in video by introducing RecSM, a recursive, residual-estimation framework that leverages temporal context to compute current-frame disparities from the previous frame. The core idea is a stackable construction where each SCS combines a Multi-scale Residual Estimation Module with a shared-weights Disparity Optimization Module, progressively refining disparities while keeping computation low. Empirical results on KITTI show that RecSM with three SCSs delivers about a 4x speedup over a strong baseline (ACVNet) with only a small (~0.7%) drop in accuracy, and ablations validate the effectiveness of the MREM, DOM, and the dynamic, stackable design. The approach promises practical impact for fast, autonomous-driving–oriented stereo vision, offering a tunable speed-accuracy tradeoff and a flexible deployment path across scenarios.

Abstract

Due to the high similarity of disparity between consecutive frames in video sequences, the area where disparity changes is defined as the residual map, which can be calculated. Based on this, we propose RecSM, a network based on residual estimation with a flexible recursive structure for video stereo matching. The RecSM network accelerates stereo matching using a Multi-scale Residual Estimation Module (MREM), which employs the temporal context as a reference and rapidly calculates the disparity for the current frame by computing only the residual values between the current and previous frames. To further reduce the error of estimated disparities, we use the Disparity Optimization Module (DOM) and Temporal Attention Module (TAM) to enforce constraints between each module, and together with MREM, form a flexible Stackable Computation Structure (SCS), which allows for the design of different numbers of SCS based on practical scenarios. Experimental results demonstrate that with a stack count of 3, RecSM achieves a 4x speed improvement compared to ACVNet, running at 0.054 seconds based on one NVIDIA RTX 2080TI GPU, with an accuracy decrease of only 0.7%. Code is available at https://github.com/Y0uchenZ/RecSM.
Paper Structure (19 sections, 3 equations, 13 figures, 7 tables)

This paper contains 19 sections, 3 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: RecSM network architecture.
  • Figure 2: Visualization of disparity changes in continuous frames of road scenes. Warmer colors indicate larger changes, while cooler colors represent smaller variations.
  • Figure 3: Visualization of disparity change distribution in road scene.
  • Figure 4: Single-Scale residual estimation module (small-scale).
  • Figure 5: (a) MREM (b) Temporal attention fusion structure in the large-scale branch.
  • ...and 8 more figures