Table of Contents
Fetching ...

FlowDepth: Decoupling Optical Flow for Self-Supervised Monocular Depth Estimation

Yiyang Sun, Zhiyuan Xu, Xiaonian Wang, Jing Yao

TL;DR

This work proposes FlowDepth, where a Dynamic Motion Flow Module (DMFM) decouples the optical flow by a mechanism-based approach and warps the dynamic regions thus solving the mismatch problem and shows that this method outperforms the state-of-the-art methods.

Abstract

Self-supervised multi-frame methods have currently achieved promising results in depth estimation. However, these methods often suffer from mismatch problems due to the moving objects, which break the static assumption. Additionally, unfairness can occur when calculating photometric errors in high-freq or low-texture regions of the images. To address these issues, existing approaches use additional semantic priori black-box networks to separate moving objects and improve the model only at the loss level. Therefore, we propose FlowDepth, where a Dynamic Motion Flow Module (DMFM) decouples the optical flow by a mechanism-based approach and warps the dynamic regions thus solving the mismatch problem. For the unfairness of photometric errors caused by high-freq and low-texture regions, we use Depth-Cue-Aware Blur (DCABlur) and Cost-Volume sparsity loss respectively at the input and the loss level to solve the problem. Experimental results on the KITTI and Cityscapes datasets show that our method outperforms the state-of-the-art methods.

FlowDepth: Decoupling Optical Flow for Self-Supervised Monocular Depth Estimation

TL;DR

This work proposes FlowDepth, where a Dynamic Motion Flow Module (DMFM) decouples the optical flow by a mechanism-based approach and warps the dynamic regions thus solving the mismatch problem and shows that this method outperforms the state-of-the-art methods.

Abstract

Self-supervised multi-frame methods have currently achieved promising results in depth estimation. However, these methods often suffer from mismatch problems due to the moving objects, which break the static assumption. Additionally, unfairness can occur when calculating photometric errors in high-freq or low-texture regions of the images. To address these issues, existing approaches use additional semantic priori black-box networks to separate moving objects and improve the model only at the loss level. Therefore, we propose FlowDepth, where a Dynamic Motion Flow Module (DMFM) decouples the optical flow by a mechanism-based approach and warps the dynamic regions thus solving the mismatch problem. For the unfairness of photometric errors caused by high-freq and low-texture regions, we use Depth-Cue-Aware Blur (DCABlur) and Cost-Volume sparsity loss respectively at the input and the loss level to solve the problem. Experimental results on the KITTI and Cityscapes datasets show that our method outperforms the state-of-the-art methods.
Paper Structure (15 sections, 10 equations, 7 figures, 5 tables)

This paper contains 15 sections, 10 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Architecture of our FlowDepth. The images $I_{t-1}$ and $I_t$ are first passed through the prior networks to get depth, camera motion, and optical flow prior. Then DMFM decouples the moving objects with these prior. The new images are fed into a multi-frame depth estimation network constrained by the cost-volume sparse loss. Finally, it generates the depth estimation results. Before calculating the multi-frame depth reprojection loss, the images will go through the DCABlur module for blurring to mitigate high-freq texture problems.
  • Figure 2: The detailed structure of DMFM. Firstly, the static optical flow $F^s$ is obtained using depth and pose priors. Then, the overall optical flow $F^{all}_{t-1 \rightarrow t}$ is decoupled by $F^s_{t-1 \rightarrow t}$ to obtain the dynamic optical flow $F^d_{t-1 \rightarrow t}$. By learning a mask network, $I_{t-1}$ is warped according to the $F^d_{t-1 \rightarrow t}$ in dynamic area to get $I_{dec(t-1)}$. Meanwhile, by directly applying the $F^s_{t \rightarrow t-1}$ to $I_{t}$, we can also obtain $I_{dec(t)}$. Theoretically, $I_{dec(t-1)}$ and $I_{dec(t)}$ are the same.
  • Figure 3: Illustration of DCABlur. The Depth Cue is pre-trained as shown in the grey area.
  • Figure 4: The depth uncertainty caused by low-texture regions in cost volume. When the pixels in the low-texture region are projected to the matching map according to the candidate depths, several extremely close losses are obtained. Ideally, the probability of the feature cost in the candidate depth domain should show an unimodal distribution, but the real entropy is much larger than the expected one.
  • Figure 5: Examples of outputs of each part on KITTI.
  • ...and 2 more figures