Table of Contents
Fetching ...

Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

Genki Kinoshita, Ko Nishino

TL;DR

A novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos, that democratizes its deployment by establishing the means to convert any model into a metric depth estimator.

Abstract

In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as FUMET. The key idea is to leverage cars found on the road as sources of scale supervision and to incorporate them in network training robustly. FUMET detects and estimates the sizes of cars in a frame and aggregates scale information extracted from them into an estimate of the camera height whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network so that they become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and the Cityscapes datasets show the effectiveness of FUMET, which achieves state-of-the-art accuracy. We also show that FUMET enables training on mixed datasets of different camera heights, which leads to larger-scale training and better generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and FUMET democratizes its deployment by establishing the means to convert any model into a metric depth estimator.

Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

TL;DR

A novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos, that democratizes its deployment by establishing the means to convert any model into a metric depth estimator.

Abstract

In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as FUMET. The key idea is to leverage cars found on the road as sources of scale supervision and to incorporate them in network training robustly. FUMET detects and estimates the sizes of cars in a frame and aggregates scale information extracted from them into an estimate of the camera height whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network so that they become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and the Cityscapes datasets show the effectiveness of FUMET, which achieves state-of-the-art accuracy. We also show that FUMET enables training on mixed datasets of different camera heights, which leads to larger-scale training and better generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and FUMET democratizes its deployment by establishing the means to convert any model into a metric depth estimator.
Paper Structure (35 sections, 13 equations, 5 figures, 11 tables)

This paper contains 35 sections, 13 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Overview of FUMET. At each training step $n$, an unscaled camera height $H'^{n,\tau}_{\mathrm{cam}}$ is computed differentiablly from the estimated depth. The previous epoch $\tau -1$ provides supervision with a scaled camera height $H^{*\tau-1}_{\mathrm{cam}}$. To obtain this scaled camera height supervision, Silhouette Projector first computes the object silhouette heights $H'_\mathrm{obj}$ from the depth map. By comparing $H'_\mathrm{obj}$ and the estimated one $H_\mathrm{obj}$ with LSP, per-frame scale factor $s$ is determined and we obtain the scaled camera height $H^{n,\tau}_\mathrm{cam} = s\cdot H'^{n,\tau}_\mathrm{cam}$. At the end of each epoch, $H^*_\mathrm{cam}$ is optimized across a series of consecutive frames and updated with the weighted moving average.
  • Figure 2: Qualitative comparison on KITTI. In error maps, the larger depth errors are represented in red, smaller ones are in blue. The model trained with our FUMET predicts more accurate depth maps compared to the weakly-supervised methods.
  • Figure 3: The proportion of frames with the number of observed cars in the training dataset of KITTI KITTI. There is no or only one observed car in nearly half of the frames in the dataset,
  • Figure 4: Visualization of beam/instance occlusion and truncation augmentations for training LSP.
  • Figure 5: Additional qualitative comparison on KITTI. We compare estimated depth maps of Monodepth2 Monodepth2 trained with FUMET to the ones of weakly-supervised methods: G2S GPS, DynaDepth IMU, and VADepth VADepth. In error maps, the larger depth errors are represented in red, and smaller ones are depicted in blue. The results show that FUMET can predict more accurate depth maps in various scenes, compared with the weakly-supervised methods.