Table of Contents
Fetching ...

MAL: Motion-Aware Loss with Temporal and Distillation Hints for Self-Supervised Depth Estimation

Yue-Jiang Dong, Fang-Lue Zhang, Song-Hai Zhang

TL;DR

Motion-Aware Loss associates the spatial locations of moving objects with the temporal order of input frames to eliminate errors induced by object motion and introduces MAL, a novel, plug-and-play module designed for seamless integration into multi-frame self-supervised monocular depth estimation methods.

Abstract

Depth perception is crucial for a wide range of robotic applications. Multi-frame self-supervised depth estimation methods have gained research interest due to their ability to leverage large-scale, unlabeled real-world data. However, the self-supervised methods often rely on the assumption of a static scene and their performance tends to degrade in dynamic environments. To address this issue, we present Motion-Aware Loss, which leverages the temporal relation among consecutive input frames and a novel distillation scheme between the teacher and student networks in the multi-frame self-supervised depth estimation methods. Specifically, we associate the spatial locations of moving objects with the temporal order of input frames to eliminate errors induced by object motion. Meanwhile, we enhance the original distillation scheme in multi-frame methods to better exploit the knowledge from a teacher network. MAL is a novel, plug-and-play module designed for seamless integration into multi-frame self-supervised monocular depth estimation methods. Adding MAL into previous state-of-the-art methods leads to a reduction in depth estimation errors by up to 4.2% and 10.8% on KITTI and CityScapes benchmarks, respectively.

MAL: Motion-Aware Loss with Temporal and Distillation Hints for Self-Supervised Depth Estimation

TL;DR

Motion-Aware Loss associates the spatial locations of moving objects with the temporal order of input frames to eliminate errors induced by object motion and introduces MAL, a novel, plug-and-play module designed for seamless integration into multi-frame self-supervised monocular depth estimation methods.

Abstract

Depth perception is crucial for a wide range of robotic applications. Multi-frame self-supervised depth estimation methods have gained research interest due to their ability to leverage large-scale, unlabeled real-world data. However, the self-supervised methods often rely on the assumption of a static scene and their performance tends to degrade in dynamic environments. To address this issue, we present Motion-Aware Loss, which leverages the temporal relation among consecutive input frames and a novel distillation scheme between the teacher and student networks in the multi-frame self-supervised depth estimation methods. Specifically, we associate the spatial locations of moving objects with the temporal order of input frames to eliminate errors induced by object motion. Meanwhile, we enhance the original distillation scheme in multi-frame methods to better exploit the knowledge from a teacher network. MAL is a novel, plug-and-play module designed for seamless integration into multi-frame self-supervised monocular depth estimation methods. Adding MAL into previous state-of-the-art methods leads to a reduction in depth estimation errors by up to 4.2% and 10.8% on KITTI and CityScapes benchmarks, respectively.
Paper Structure (19 sections, 10 equations, 4 figures, 4 tables)

This paper contains 19 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Qualitative Demonstration of Our MAL's Effectiveness on CityScapes cordts2016cityscapes Dataset. MAL is designed for multi-frame depth estimation methods (a, c). It's a plug-and-play module (b, d) aimed at improving depth perception (f), especially for moving objects, in dynamic scenes (e).
  • Figure 2: Framework of Multi-Frame Self-Supervised Depth Estimation. The three sub-networks (a-c) are trained concurrently with both image reprojection loss (e) and consistency loss (d). The dotted line indicates that the gradients of the teacher network are not updated by the consistency loss.
  • Figure 3: Temporal Hints. Linking object positions to input frames' temporal order via a linear motion model, we align object positions (d-e) and significantly reduce motion-induced errors in the reconstructed image (f).
  • Figure 4: Qualitative Analysis of the Indispensability of Both the Temporal and Distillation Hints. Please refer to Section \ref{['sec:complementary']} for a detailed analysis.