Table of Contents
Fetching ...

On depth prediction for autonomous driving using self-supervised learning

Houssem Boulahbal

TL;DR

This work tackles depth prediction for autonomous driving using monocular self-supervised learning, addressing both immediate depth inference and future-depth forecasting. It introduces four key directions: (i) a contrario conditional GANs to enforce explicit conditionality for robust cross-domain depth and segmentation tasks; (ii) an instance-aware, transformer-based single-image-to-depth approach that handles dynamic objects by estimating per-object 6-DOF pose; (iii) a video-to-depth forecasting framework using transformers to predict future depths while maintaining spatio-temporal consistency; and (iv) a video-to-video depth model that extends forecasting to sequences of depth maps with a spatial-temporal attention architecture. Collectively, these contributions advance self-supervised monocular depth estimation, dynamics handling, and future-depth prediction, with strong performance on KITTI and related benchmarks and clear applicability to AD/ADAS systems. The work highlights the importance of explicit conditionality, dynamic-object modeling, and temporal coherence for reliable depth-based scene understanding in autonomous driving. signifying the potential for scalable, label-free perception systems in real-world driving contexts.

Abstract

Perception of the environment is a critical component for enabling autonomous driving. It provides the vehicle with the ability to comprehend its surroundings and make informed decisions. Depth prediction plays a pivotal role in this process, as it helps the understanding of the geometry and motion of the environment. This thesis focuses on the challenge of depth prediction using monocular self-supervised learning techniques. The problem is approached from a broader perspective first, exploring conditional generative adversarial networks (cGANs) as a potential technique to achieve better generalization was performed. In doing so, a fundamental contribution to the conditional GANs, the acontrario cGAN was proposed. The second contribution entails a single image-to-depth self-supervised method, proposing a solution for the rigid-scene assumption using a novel transformer-based method that outputs a pose for each dynamic object. The third significant aspect involves the introduction of a video-to-depth map forecasting approach. This method serves as an extension of self-supervised techniques to predict future depths. This involves the creation of a novel transformer model capable of predicting the future depth of a given scene. Moreover, the various limitations of the aforementioned methods were addressed and a video-to-video depth maps model was proposed. This model leverages the spatio-temporal consistency of the input and output sequence to predict a more accurate depth sequence output. These methods have significant applications in autonomous driving (AD) and advanced driver assistance systems (ADAS).

On depth prediction for autonomous driving using self-supervised learning

TL;DR

This work tackles depth prediction for autonomous driving using monocular self-supervised learning, addressing both immediate depth inference and future-depth forecasting. It introduces four key directions: (i) a contrario conditional GANs to enforce explicit conditionality for robust cross-domain depth and segmentation tasks; (ii) an instance-aware, transformer-based single-image-to-depth approach that handles dynamic objects by estimating per-object 6-DOF pose; (iii) a video-to-depth forecasting framework using transformers to predict future depths while maintaining spatio-temporal consistency; and (iv) a video-to-video depth model that extends forecasting to sequences of depth maps with a spatial-temporal attention architecture. Collectively, these contributions advance self-supervised monocular depth estimation, dynamics handling, and future-depth prediction, with strong performance on KITTI and related benchmarks and clear applicability to AD/ADAS systems. The work highlights the importance of explicit conditionality, dynamic-object modeling, and temporal coherence for reliable depth-based scene understanding in autonomous driving. signifying the potential for scalable, label-free perception systems in real-world driving contexts.

Abstract

Perception of the environment is a critical component for enabling autonomous driving. It provides the vehicle with the ability to comprehend its surroundings and make informed decisions. Depth prediction plays a pivotal role in this process, as it helps the understanding of the geometry and motion of the environment. This thesis focuses on the challenge of depth prediction using monocular self-supervised learning techniques. The problem is approached from a broader perspective first, exploring conditional generative adversarial networks (cGANs) as a potential technique to achieve better generalization was performed. In doing so, a fundamental contribution to the conditional GANs, the acontrario cGAN was proposed. The second contribution entails a single image-to-depth self-supervised method, proposing a solution for the rigid-scene assumption using a novel transformer-based method that outputs a pose for each dynamic object. The third significant aspect involves the introduction of a video-to-depth map forecasting approach. This method serves as an extension of self-supervised techniques to predict future depths. This involves the creation of a novel transformer model capable of predicting the future depth of a given scene. Moreover, the various limitations of the aforementioned methods were addressed and a video-to-video depth maps model was proposed. This model leverages the spatio-temporal consistency of the input and output sequence to predict a more accurate depth sequence output. These methods have significant applications in autonomous driving (AD) and advanced driver assistance systems (ADAS).
Paper Structure (132 sections, 51 equations, 44 figures, 15 tables)

This paper contains 132 sections, 51 equations, 44 figures, 15 tables.

Figures (44)

  • Figure 1: Illustration of steps of the cycle taken by an autonomous agent
  • Figure 2: An example of a possible mapping of a dataset that contains two classes using two features
  • Figure 3: Taxonomy of deep learning methods for depth prediction
  • Figure 4: Training-time versus test-time for self-supervised depth prediction. The red arrow shows the test-time pipeline, while the blue arrows show the training-time pipeline. The warping function and the pose network are used only during training.
  • Figure 5: The Pix2Pix architecture with encoder-decoder structure and skip connections, inspired by the U-Net architecture. The encoder captures essential features from the source image, while the decoder generates the corresponding output image in the target domain. The inclusion of skip connections helps preserve details, leading to high-quality outputs. Figure from Isola2017ImagetoImageTW
  • ...and 39 more figures