Table of Contents
Fetching ...

Temporally Consistent Stereo Matching

Jiaxi Zeng, Chengtang Yao, Yuwei Wu, Yunde Jia

TL;DR

This work tackles temporal inconsistency in video stereo matching by introducing TC-Stereo, which combines temporal disparity completion to provide a robust initialization with semi-dense priors, temporal state fusion to produce coherent hidden states, and a dual-space refinement that iterates in both disparity and disparity-gradient spaces. The method leverages a cost-volume based semi-dense map, a lightweight fusion module, and gradient-guided propagation to extend local surface constraints globally, improving performance in ill-posed regions. Extensive experiments across synthetic and real datasets show state-of-the-art temporal consistency and competitive accuracy with high efficiency, including online inference at frame rates suitable for practical applications. The approach offers robust performance in occlusions and reflections, with limitations in extreme dynamic scenes and pose errors, but demonstrates clear advantages for online, temporally coherent depth estimation in stereo video pipelines.

Abstract

Stereo matching provides depth estimation from binocular images for downstream applications. These applications mostly take video streams as input and require temporally consistent depth maps. However, existing methods mainly focus on the estimation at the single-frame level. This commonly leads to temporally inconsistent results, especially in ill-posed regions. In this paper, we aim to leverage temporal information to improve the temporal consistency, accuracy, and efficiency of stereo matching. To achieve this, we formulate video stereo matching as a process of temporal disparity completion followed by continuous iterative refinements. Specifically, we first project the disparity of the previous timestamp to the current viewpoint, obtaining a semi-dense disparity map. Then, we complete this map through a disparity completion module to obtain a well-initialized disparity map. The state features from the current completion module and from the past refinement are fused together, providing a temporally coherent state for subsequent refinement. Based on this coherent state, we introduce a dual-space refinement module to iteratively refine the initialized result in both disparity and disparity gradient spaces, improving estimations in ill-posed regions. Extensive experiments demonstrate that our method effectively alleviates temporal inconsistency while enhancing both accuracy and efficiency.

Temporally Consistent Stereo Matching

TL;DR

This work tackles temporal inconsistency in video stereo matching by introducing TC-Stereo, which combines temporal disparity completion to provide a robust initialization with semi-dense priors, temporal state fusion to produce coherent hidden states, and a dual-space refinement that iterates in both disparity and disparity-gradient spaces. The method leverages a cost-volume based semi-dense map, a lightweight fusion module, and gradient-guided propagation to extend local surface constraints globally, improving performance in ill-posed regions. Extensive experiments across synthetic and real datasets show state-of-the-art temporal consistency and competitive accuracy with high efficiency, including online inference at frame rates suitable for practical applications. The approach offers robust performance in occlusions and reflections, with limitations in extreme dynamic scenes and pose errors, but demonstrates clear advantages for online, temporally coherent depth estimation in stereo video pipelines.

Abstract

Stereo matching provides depth estimation from binocular images for downstream applications. These applications mostly take video streams as input and require temporally consistent depth maps. However, existing methods mainly focus on the estimation at the single-frame level. This commonly leads to temporally inconsistent results, especially in ill-posed regions. In this paper, we aim to leverage temporal information to improve the temporal consistency, accuracy, and efficiency of stereo matching. To achieve this, we formulate video stereo matching as a process of temporal disparity completion followed by continuous iterative refinements. Specifically, we first project the disparity of the previous timestamp to the current viewpoint, obtaining a semi-dense disparity map. Then, we complete this map through a disparity completion module to obtain a well-initialized disparity map. The state features from the current completion module and from the past refinement are fused together, providing a temporally coherent state for subsequent refinement. Based on this coherent state, we introduce a dual-space refinement module to iteratively refine the initialized result in both disparity and disparity gradient spaces, improving estimations in ill-posed regions. Extensive experiments demonstrate that our method effectively alleviates temporal inconsistency while enhancing both accuracy and efficiency.
Paper Structure (14 sections, 10 equations, 6 figures, 4 tables)

This paper contains 14 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) Visualization of disparity sequences from RAFT-Stereolipson2021raft and our TC-Stereo. The failure cases of RAFT-Stereo lie in the reflective areas on the ground. A small motion can cause severe jitters in disparity predictions, while our method achieves temporally stable outputs. (b) The jitter $|\Delta d|$ between aligned successive disparity maps in all and occluded areas on TartanAir datasetwang2020tartanair. Our method achieves better temporal consistency than RAFT-Stereo. (c) The update step size changed with the iterations of RAFT-Stereo and our TC-Stereo on TartanAir. Compared to RAFT-Stereo, our method performs a disparity search within a local range.
  • Figure 2: Pipeline of TC-Stereo. We first use an encoder to extract left and right features for the current stereo frame. These features are then used to construct a cost volume. A semi-dense disparity map, derived from the cost volume (for the first frame) or projected from the previous timestamp (for subsequent frames), is fed into the Temporal Disparity Completion (TDC) module to obtain an initial dense disparity map. The output state of the TDC module is fused with the state from the past to provide an initial hidden state for refinement. The dual-space refinement module iteratively retrieves the cost volume and alternately refines the disparity map in the disparity and disparity gradient spaces. The final disparity map and hidden state are projected into the viewpoint of the next frame, serving as the temporal information for continuous disparity estimation.
  • Figure 3: Architecture of the dual-space refinement module. The disparity map corresponds to a scene consisting of a wall, a floor, and a transparent glass door. The encircled L denotes the lookup operation to the cost volume, P represents the gradient-guided disparity propagation, and $\mathbf \Sigma$ means weighted summation.
  • Figure 4: Visualizations on KITTI 2015. (a) Comparison of temporal disparity sequences from RAFT-Stereolipson2021raft, IGEVxu2023iterative and our method. (b) Comparison of disparities in ill-posed regions between RAFT-Stereolipson2021raft and our method.
  • Figure 5: Visualizations of the disparity map and the update step size at each iteration.
  • ...and 1 more figures