Table of Contents
Fetching ...

MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive Applications

Gasser Elazab, Torben Gräber, Michael Unterreiner, Olaf Hellwich

TL;DR

The paper tackles the problem of obtaining metric-scaled depth from monocular video in automotive settings, where scale matters for navigation and planning. It introduces MonoPP, a self-supervised framework that uses planar-parallax geometry and a teacher–student pipeline to transfer metric-scale information from a planar scene to a single-frame depth predictor, requiring only the camera height above ground as additional input. The method achieves state-of-the-art metric-depth performance on KITTI and demonstrates breakthrough metric-depth results on Cityscapes, illustrating robustness across datasets. The approach combines a Planar-Parallax teacher with a monocular student, employs specialized masks and losses to handle dynamics, and runs efficiently, highlighting practical applicability for real-world vehicle perception.

Abstract

Self-supervised monocular depth estimation (MDE) has gained popularity for obtaining depth predictions directly from videos. However, these methods often produce scale invariant results, unless additional training signals are provided. Addressing this challenge, we introduce a novel self-supervised metric-scaled MDE model that requires only monocular video data and the camera's mounting position, both of which are readily available in modern vehicles. Our approach leverages planar-parallax geometry to reconstruct scene structure. The full pipeline consists of three main networks, a multi-frame network, a singleframe network, and a pose network. The multi-frame network processes sequential frames to estimate the structure of the static scene using planar-parallax geometry and the camera mounting position. Based on this reconstruction, it acts as a teacher, distilling knowledge such as scale information, masked drivable area, metric-scale depth for the static scene, and dynamic object mask to the singleframe network. It also aids the pose network in predicting a metric-scaled relative pose between two subsequent images. Our method achieved state-of-the-art results for the driving benchmark KITTI for metric-scaled depth prediction. Notably, it is one of the first methods to produce self-supervised metric-scaled depth prediction for the challenging Cityscapes dataset, demonstrating its effectiveness and versatility.

MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive Applications

TL;DR

The paper tackles the problem of obtaining metric-scaled depth from monocular video in automotive settings, where scale matters for navigation and planning. It introduces MonoPP, a self-supervised framework that uses planar-parallax geometry and a teacher–student pipeline to transfer metric-scale information from a planar scene to a single-frame depth predictor, requiring only the camera height above ground as additional input. The method achieves state-of-the-art metric-depth performance on KITTI and demonstrates breakthrough metric-depth results on Cityscapes, illustrating robustness across datasets. The approach combines a Planar-Parallax teacher with a monocular student, employs specialized masks and losses to handle dynamics, and runs efficiently, highlighting practical applicability for real-world vehicle perception.

Abstract

Self-supervised monocular depth estimation (MDE) has gained popularity for obtaining depth predictions directly from videos. However, these methods often produce scale invariant results, unless additional training signals are provided. Addressing this challenge, we introduce a novel self-supervised metric-scaled MDE model that requires only monocular video data and the camera's mounting position, both of which are readily available in modern vehicles. Our approach leverages planar-parallax geometry to reconstruct scene structure. The full pipeline consists of three main networks, a multi-frame network, a singleframe network, and a pose network. The multi-frame network processes sequential frames to estimate the structure of the static scene using planar-parallax geometry and the camera mounting position. Based on this reconstruction, it acts as a teacher, distilling knowledge such as scale information, masked drivable area, metric-scale depth for the static scene, and dynamic object mask to the singleframe network. It also aids the pose network in predicting a metric-scaled relative pose between two subsequent images. Our method achieved state-of-the-art results for the driving benchmark KITTI for metric-scaled depth prediction. Notably, it is one of the first methods to produce self-supervised metric-scaled depth prediction for the challenging Cityscapes dataset, demonstrating its effectiveness and versatility.

Paper Structure

This paper contains 21 sections, 26 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: (A) Example from two sequential frames of KITTI, aligned by the planar road homography, it is obvious that residual flow increases as the height of the object increases relative to the road. (B) Illustrative example of the epipolarity of the residual flow between $p_t$ and $p^w_s$, the figure is inspired by RoadPlanarParallax
  • Figure 2: Our framework is composed of two primary pipelines. The first pipeline performs monocular depth estimation by using a single image, $I_t$, as input. The second pipeline aims to reconstruct the geometry by determining the scale for the previously warped image, $I_{s}^{w}$. It then calculates the structure from this information and serves a dual purpose: it distills information to the monocular depth estimator to learn reliable depth about the static scenes, and it provides a mask to filter out dynamic objects. Regarding the colormap, brighter yellow means higher values, and vice versa. All images are cropped by the same ratio for better visualization.
  • Figure 3: Rendered 3D point cloud from MonoPP on a KITTI Eigen split test sample (unseen during training). Input image is shown at the bottom left.
  • Figure 4: Example from Cityscapes cordts2016cityscapes
  • Figure 5: Example from KITTI geiger2012kitti
  • ...and 14 more figures