On depth prediction for autonomous driving using self-supervised learning
Houssem Boulahbal
TL;DR
This work tackles depth prediction for autonomous driving using monocular self-supervised learning, addressing both immediate depth inference and future-depth forecasting. It introduces four key directions: (i) a contrario conditional GANs to enforce explicit conditionality for robust cross-domain depth and segmentation tasks; (ii) an instance-aware, transformer-based single-image-to-depth approach that handles dynamic objects by estimating per-object 6-DOF pose; (iii) a video-to-depth forecasting framework using transformers to predict future depths while maintaining spatio-temporal consistency; and (iv) a video-to-video depth model that extends forecasting to sequences of depth maps with a spatial-temporal attention architecture. Collectively, these contributions advance self-supervised monocular depth estimation, dynamics handling, and future-depth prediction, with strong performance on KITTI and related benchmarks and clear applicability to AD/ADAS systems. The work highlights the importance of explicit conditionality, dynamic-object modeling, and temporal coherence for reliable depth-based scene understanding in autonomous driving. signifying the potential for scalable, label-free perception systems in real-world driving contexts.
Abstract
Perception of the environment is a critical component for enabling autonomous driving. It provides the vehicle with the ability to comprehend its surroundings and make informed decisions. Depth prediction plays a pivotal role in this process, as it helps the understanding of the geometry and motion of the environment. This thesis focuses on the challenge of depth prediction using monocular self-supervised learning techniques. The problem is approached from a broader perspective first, exploring conditional generative adversarial networks (cGANs) as a potential technique to achieve better generalization was performed. In doing so, a fundamental contribution to the conditional GANs, the acontrario cGAN was proposed. The second contribution entails a single image-to-depth self-supervised method, proposing a solution for the rigid-scene assumption using a novel transformer-based method that outputs a pose for each dynamic object. The third significant aspect involves the introduction of a video-to-depth map forecasting approach. This method serves as an extension of self-supervised techniques to predict future depths. This involves the creation of a novel transformer model capable of predicting the future depth of a given scene. Moreover, the various limitations of the aforementioned methods were addressed and a video-to-video depth maps model was proposed. This model leverages the spatio-temporal consistency of the input and output sequence to predict a more accurate depth sequence output. These methods have significant applications in autonomous driving (AD) and advanced driver assistance systems (ADAS).
