Table of Contents
Fetching ...

Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Junpeng Jing, Ye Mao, Krystian Mikolajczyk

TL;DR

A novel framework is proposed, BiDAStereo, that achieves consistent dynamic stereo matching as local matching and global aggregation and considers correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency.

Abstract

Dynamic stereo matching is the task of estimating consistent disparities from stereo videos with dynamic objects. Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. Existing video methods apply per-frame matching and window-based cost aggregation across the time dimension, leading to low-frequency oscillations at the scale of the window size. Towards this challenge, we develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. We further propose a novel framework, BiDAStereo, that achieves consistent dynamic stereo matching. Unlike the existing methods, we model this task as local matching and global aggregation. Locally, we consider correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency. Globally, to exploit the entire sequence's consistency and extract dynamic scene cues for aggregation, we develop a motion-propagation recurrent unit. Extensive experiments demonstrate the performance of our method, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks.

Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

TL;DR

A novel framework is proposed, BiDAStereo, that achieves consistent dynamic stereo matching as local matching and global aggregation and considers correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency.

Abstract

Dynamic stereo matching is the task of estimating consistent disparities from stereo videos with dynamic objects. Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. Existing video methods apply per-frame matching and window-based cost aggregation across the time dimension, leading to low-frequency oscillations at the scale of the window size. Towards this challenge, we develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. We further propose a novel framework, BiDAStereo, that achieves consistent dynamic stereo matching. Unlike the existing methods, we model this task as local matching and global aggregation. Locally, we consider correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency. Globally, to exploit the entire sequence's consistency and extract dynamic scene cues for aggregation, we develop a motion-propagation recurrent unit. Extensive experiments demonstrate the performance of our method, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks.
Paper Structure (16 sections, 8 equations, 8 figures, 3 tables)

This paper contains 16 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Dynamic stereo video. First row: depth maps of the same region in three different frames. Second row: depth maps converted to globally aligned point clouds and rendered with a camera displaced by 15 degree angles. Our method gives consistent and accurate disparities without flickering.
  • Figure 2: Illustration of the difference between existing methods (left) and our proposed method (right). Existing methods separate video sequences into fixed segments for processing, adopt a per-frame matching operation to build the cost volumes and apply the sliding window for aggregation, thus limiting the information propagation to a fixed time length. Our method adopts bidirectional alignment for local matching, where the cost volumes are built within the neighboring frames. A self-update mechanism is proposed to update the current state via bidirectional alignment and propagate global consistency across the whole sequence. Details of the self-update can be seen in Sec. \ref{['Motion-propagation based Recurrent Unit']}.
  • Figure 3: Left: The overall pipeline of the proposed method. Given a pair of stereo sequences, bidirectional optical flows are estimated and feature maps are extracted at three scales. In each scale, the predicted disparities are refined iteratively in the update module, and the final output of the former stage is fed to the next one as an initialization. The same update module is reused in each stage. Right: The architecture of the update module. For each iteration, the Triple-Frame Correlation Layer (TFCL) is used to compute cost volumes from triple-frame feature maps. The motion-propagation Recurrent Unit (MRU) is used for global cost aggregation and disparity estimations.
  • Figure 4: The architecture of TFCL and MRU. For TFCL, bidirectional alignment is conducted from the adjacent right frames to the center frame. Cost volumes are built among the left frame and the aligned right frame. For MRU, convolutional encoders are adopted for correlations, disparities, and motion features. A motion hidden state feature is introduced for each frame as an auxiliary context in global propagation. In each iteration, adjacent motion hidden state features are aligned towards the center one, updating the center one and propagating wider temporal information.
  • Figure 5: Qualitative comparisons on Sintel final dataset sintel.
  • ...and 3 more figures