Table of Contents
Fetching ...

Match Stereo Videos via Bidirectional Alignment

Junpeng Jing, Ye Mao, Anlan Qiu, Krystian Mikolajczyk

TL;DR

This work introduces a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods, and proposes a bidirectional alignment mechanism for adjacent frames as a fundamental operation.

Abstract

Video stereo matching is the task of estimating consistent disparity maps from rectified stereo videos. There is considerable scope for improvement in both datasets and methods within this area. Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos. Existing video methods typically employ sliding window operation over time dimension, which can result in low-frequency oscillations corresponding to the window size. To address these challenges, we propose a bidirectional alignment mechanism for adjacent frames as a fundamental operation. Building on this, we introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods. Regarding datasets, current synthetic object-based and indoor datasets are commonly used for training and benchmarking, with a lack of outdoor nature scenarios. To bridge this gap, we present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation. Extensive experiments on in-domain, out-of-domain, and robustness evaluation demonstrate the contribution of our methods and datasets, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks. The project page, demos, code, and datasets are available at: \url{https://tomtomtommi.github.io/BiDAVideo/}.

Match Stereo Videos via Bidirectional Alignment

TL;DR

This work introduces a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods, and proposes a bidirectional alignment mechanism for adjacent frames as a fundamental operation.

Abstract

Video stereo matching is the task of estimating consistent disparity maps from rectified stereo videos. There is considerable scope for improvement in both datasets and methods within this area. Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos. Existing video methods typically employ sliding window operation over time dimension, which can result in low-frequency oscillations corresponding to the window size. To address these challenges, we propose a bidirectional alignment mechanism for adjacent frames as a fundamental operation. Building on this, we introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods. Regarding datasets, current synthetic object-based and indoor datasets are commonly used for training and benchmarking, with a lack of outdoor nature scenarios. To bridge this gap, we present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation. Extensive experiments on in-domain, out-of-domain, and robustness evaluation demonstrate the contribution of our methods and datasets, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks. The project page, demos, code, and datasets are available at: \url{https://tomtomtommi.github.io/BiDAVideo/}.
Paper Structure (24 sections, 20 equations, 10 figures, 11 tables)

This paper contains 24 sections, 20 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Overview of the proposed methods. The illustration compares existing methods (left) with the proposed approaches (right). (a) Traditional image-based methods estimate disparity independently for each frame, leading to temporal inconsistency. (b) In contrast, our plugin stabilizer network takes inconsistent disparities as input to enhance temporal consistency without retraining the original stereo model. (c) Existing video methods segment sequences into fixed intervals, limiting information propagation to a short time frame range. (d) Our video method constructs cost volumes across neighboring frames and incorporates a recurrent update mechanism based on bidirectional alignment, ensuring global consistency throughout the entire sequence.
  • Figure 2: Infinigen Stereo Video (Infinigen SV) Dataset. Example video frames and depth maps from the proposed dataset. This dataset features a diverse range of outdoor nature scenes, offering a large-scale training set and a benchmark for disparity estimation in near-realistic environments.
  • Figure 3: South Kensington Stereo Video (SouthKen SV) Dataset. The proposed dataset contains diverse indoor and outdoor scenes from the South Kensington area in London, UK, captured using a stereo camera, providing a qualitative benchmark for real-world disparity estimation.
  • Figure 4: Objects presented in SouthKen SV dataset.
  • Figure 5: Upper Left: The overall pipeline of the proposed BiDAStereo. Given a pair of stereo sequences, bidirectional optical flows are estimated, and feature maps are extracted at three scales. At each scale, the predicted disparities are refined iteratively in the update module, and the final output of the previous stage is used as initialization for the next stage. The same update module is reused at each stage. Upper Right: The architecture of the update module. For each iteration, the Triple-Frame Correlation Layer (TFCL) computes cost volumes from triple-frame feature maps. The Motion-Propagation Recurrent Unit (MRU) is used for global cost aggregation and disparity estimation. Lower: The architectures of TFCL and MRU. For TFCL, bidirectional alignment is conducted from the adjacent right frames to the center frame, and cost volumes are built between the left frame and the aligned right frame. For MRU, convolutional encoders are used for correlations, disparities, and motion features. A motion hidden state feature is introduced for each frame as the auxiliary context in global propagation. In each iteration, adjacent motion hidden state features are aligned towards the center one, updating the center feature and propagating wider temporal information.
  • ...and 5 more figures