Table of Contents
Fetching ...

DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

Xiuzhe Wu, Xiaoyang Lyu, Qihao Huang, Yong Liu, Yang Wu, Ying Shan, Xiaojuan Qi

TL;DR

A self-supervised method to jointly learn 3D motion and depth from monocular videos, containing a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.

Abstract

Although considerable advancements have been attained in self-supervised depth estimation from monocular videos, most existing methods often treat all objects in a video as static entities, which however violates the dynamic nature of real-world scenes and fails to model the geometry and motion of moving objects. In this paper, we propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Depth and motion networks work collaboratively to faithfully model the geometry and dynamics of real-world scenes, which, in turn, benefits both depth and 3D motion estimation. Their predictions are further combined to synthesize a novel video frame for self-supervised training. As a core component of our framework, DO3D is a new motion disentanglement module that learns to predict camera ego-motion and instance-aware 3D object motion separately. To alleviate the difficulties in estimating non-rigid 3D object motions, they are decomposed to object-wise 6-DoF global transformations and a pixel-wise local 3D motion deformation field. Qualitative and quantitative experiments are conducted on three benchmark datasets, including KITTI, Cityscapes, and VKITTI2, where our model delivers superior performance in all evaluated settings. For the depth estimation task, our model outperforms all compared research works in the high-resolution setting, attaining an absolute relative depth error (abs rel) of 0.099 on the KITTI benchmark. Besides, our optical flow estimation results (an overall EPE of 7.09 on KITTI) also surpass state-of-the-art methods and largely improve the estimation of dynamic regions, demonstrating the effectiveness of our motion model. Our code will be available.

DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

TL;DR

A self-supervised method to jointly learn 3D motion and depth from monocular videos, containing a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.

Abstract

Although considerable advancements have been attained in self-supervised depth estimation from monocular videos, most existing methods often treat all objects in a video as static entities, which however violates the dynamic nature of real-world scenes and fails to model the geometry and motion of moving objects. In this paper, we propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Depth and motion networks work collaboratively to faithfully model the geometry and dynamics of real-world scenes, which, in turn, benefits both depth and 3D motion estimation. Their predictions are further combined to synthesize a novel video frame for self-supervised training. As a core component of our framework, DO3D is a new motion disentanglement module that learns to predict camera ego-motion and instance-aware 3D object motion separately. To alleviate the difficulties in estimating non-rigid 3D object motions, they are decomposed to object-wise 6-DoF global transformations and a pixel-wise local 3D motion deformation field. Qualitative and quantitative experiments are conducted on three benchmark datasets, including KITTI, Cityscapes, and VKITTI2, where our model delivers superior performance in all evaluated settings. For the depth estimation task, our model outperforms all compared research works in the high-resolution setting, attaining an absolute relative depth error (abs rel) of 0.099 on the KITTI benchmark. Besides, our optical flow estimation results (an overall EPE of 7.09 on KITTI) also surpass state-of-the-art methods and largely improve the estimation of dynamic regions, demonstrating the effectiveness of our motion model. Our code will be available.
Paper Structure (25 sections, 21 equations, 14 figures, 12 tables)

This paper contains 25 sections, 21 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Visualization of 3D motion fields generated by GeoNet yin2018geonet, Li et al. li2020unsupervised, and our model. Our decomposed motion model predicts a 3D motion field more consistent than the others obtained by optical flow or direct methods. R, G, B color maps correspond to motion in x, y, z directions, respectively.
  • Figure 2: Depth-loss curves of the dynamic object (a) and the static object (b). $D_{gt}$ and $D_{min}$ are the points indicating the depth ground truth and the minimum photometric loss, respectively.
  • Figure 3: Warping process visualization. We visualize two important results of inverse warping at different depths. We set the smallest depth to $0.1\times d_{gt}$ and infinity to $5 \times d_{gt}$. The forward warping mask is computed by equation \ref{['eq:inverse_warpping']} and the red mask means projected coordinates $\mathrm{u}_s$ and $\mathrm{v}_s$.
  • Figure 4: Visualization of four motion statuses. For simplification, we assume that the ego-car is moving forward and only analyze the situations where the target pixels are located in the right part of the image. Let $p_t(\mathrm{u}_t,\mathrm{v}_t)$ be the pixel in the target frame and $\mathrm{p}_s$ be the projected pixel in the source frame computed by predicted depth $\mathrm{d}_t$ and camera pose according to Equation \ref{['eq:inverse_warpping']}. $\mathrm{p}_s^\mathrm{gt}$ denotes the observed source pixel under different motion patterns of the object. For static scenes that follow the self-supervised depth estimation model, $\mathrm{p}_s^\mathrm{gt}$ and $\mathrm{p}_s$ will be the same point. In dynamic scene (a) -- (c), $\mathrm{p}_s$ will not be at the same location as $\mathrm{p}_s^\mathrm{gt}$. However, the photometric loss will encourage the model to produce the estimated depth $\mathrm{d}_t$ that enforces $\mathrm{p}_s$ to approach $\mathrm{p}_s^\mathrm{gt}$, which will mislead depth estimation optimization.
  • Figure 5: Model overview. Our system requires two consecutive video frames for camera ego-motion prediction (a). The reconstructed image $\mathrm{I}_t^\mathrm{ego}$ and original image $\mathrm{I}_t$ are used as inputs to the dynamic rigid motion module (b) which learns object-wise rigid motion $\mathrm{M}_{t\rightarrow s}^\mathrm{rig}$. Further, the residual non-rigid deformation module (c) exploits $\mathrm{I}_t^\mathrm{rig}$ and $\mathrm{I}_t$ to recover non-rigid deformation $\mathrm{M}_{t\rightarrow s}^\mathrm{def}$. In the rigid motion estimation module, object-wise bounding boxes are obtained by Mask RCNN he2017mask. We employ the RoI Align operation he2017mask to generate object-wise features.
  • ...and 9 more figures