Table of Contents
Fetching ...

Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

Jinfeng Liu, Lingtong Kong, Bo Li, Zerong Wang, Hong Gu, Jinwei Chen

TL;DR

A unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth, with spatial data augmentation through image affine transformation incorporated for data diversity, along with a triplet depth consistency loss for regularization.

Abstract

Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed as temporal augmentation. For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm and design a VFI-assisted multi-frame fusion module to align and aggregate multi-frame features, using motion and occlusion information obtained by the flow-based VFI model. Finally, we construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth. In this framework, spatial data augmentation through image affine transformation is incorporated for data diversity, along with a triplet depth consistency loss for regularization. The single- and multi-frame models can share weights, making our framework compact and memory-efficient. Extensive experiments demonstrate that our method can bring significant improvements to current advanced architectures. Source code is available at https://github.com/LiuJF1226/Mono-ViFI.

Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

TL;DR

A unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth, with spatial data augmentation through image affine transformation incorporated for data diversity, along with a triplet depth consistency loss for regularization.

Abstract

Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed as temporal augmentation. For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm and design a VFI-assisted multi-frame fusion module to align and aggregate multi-frame features, using motion and occlusion information obtained by the flow-based VFI model. Finally, we construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth. In this framework, spatial data augmentation through image affine transformation is incorporated for data diversity, along with a triplet depth consistency loss for regularization. The single- and multi-frame models can share weights, making our framework compact and memory-efficient. Extensive experiments demonstrate that our method can bring significant improvements to current advanced architectures. Source code is available at https://github.com/LiuJF1226/Mono-ViFI.
Paper Structure (29 sections, 34 equations, 12 figures, 14 tables)

This paper contains 29 sections, 34 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Overview of Mono-ViFI. (a) We achieve temporal augmentation by a flow-based VFI model $\mathcal{H}$, which also reasons out intermediate optical flow and occlusion mask for multi-frame inference. (b) For target image $I_t$, we obtain its single-frame depth $D_t$ and multi-frame depth $D_t^m$. Also, $I_t$ is augmented to $\widetilde{I}_t$ through affine transformation $\mathcal{A}$. We then calculate the depth $\widetilde{D}_t$ of $\widetilde{I}_t$ and inversely convert it to the original view, generating $\widehat{D}_t$. Finally, we enforce a triplet depth consistency loss $L_{tc}(t)$ among the three depth maps, including a standard view depth consistency loss $L_{sv}(t)$ and two scale-aware depth consistency losses, $L_{sa}(t)$ and $L_{sa}^{m}(t)$. Note that each depth map also corresponds to a photometric loss and a smoothness loss, which are omitted here.
  • Figure 1: Affine Transformation.
  • Figure 2: Our VFI-assisted multi-frame fusion depth model. The intermediate optical flow and occlusion merge mask provided by the VFI network can also be used to align and aggregate multi-frame features in an explicit manner.
  • Figure 2: Qualitative ablation results about the fourier positional encoding on KITTI. PE denotes positional encoding. Error maps in columns 2 and 4 show the Abs Rel error compared to the improved ground truth im-gt from good (blue) to bad (red).
  • Figure 3: Qualitative results on KITTI. Error maps in columns 2 and 4 show the Abs Rel error compared to the improved ground truth im-gt from good (blue) to bad (red).
  • ...and 7 more figures