Table of Contents
Fetching ...

DINO Pre-training for Vision-based End-to-end Autonomous Driving

Shubham Juneja, Povilas Daniušis, Virginijus Marcinkevičius

TL;DR

The paper tackles pre-training for vision-based end-to-end autonomous driving within imitation learning by substituting traditional supervised ImageNet pre-training with DINO self-supervised pre-training. It integrates a DINO-pretrained vision encoder into a Roach/CILRS-inspired driving architecture trained via DAgger in the CARLA Leaderboard setting, and benchmarked against a supervised baseline and VPRPre. Results indicate that DINO pre-training improves generalisation to unseen towns and weather, achieving comparable performance to VPRPre in unfamiliar conditions and demonstrating faster convergence with reduced overfitting in familiar settings. This work supports the viability of self-supervised pre-training for robust autonomous driving, highlighting its efficiency and potential for broader domain transfer without reliance on labeled data.

Abstract

In this article, we focus on the pre-training of visual autonomous driving agents in the context of imitation learning. Current methods often rely on a classification-based pre-training, which we hypothesise to be holding back from extending capabilities of implicit image understanding. We propose pre-training the visual encoder of a driving agent using the self-distillation with no labels (DINO) method, which relies on a self-supervised learning paradigm.% and is trained on an unrelated task. Our experiments in CARLA environment in accordance with the Leaderboard benchmark reveal that the proposed pre-training is more efficient than classification-based pre-training, and is on par with the recently proposed pre-training based on visual place recognition (VPRPre).

DINO Pre-training for Vision-based End-to-end Autonomous Driving

TL;DR

The paper tackles pre-training for vision-based end-to-end autonomous driving within imitation learning by substituting traditional supervised ImageNet pre-training with DINO self-supervised pre-training. It integrates a DINO-pretrained vision encoder into a Roach/CILRS-inspired driving architecture trained via DAgger in the CARLA Leaderboard setting, and benchmarked against a supervised baseline and VPRPre. Results indicate that DINO pre-training improves generalisation to unseen towns and weather, achieving comparable performance to VPRPre in unfamiliar conditions and demonstrating faster convergence with reduced overfitting in familiar settings. This work supports the viability of self-supervised pre-training for robust autonomous driving, highlighting its efficiency and potential for broader domain transfer without reliance on labeled data.

Abstract

In this article, we focus on the pre-training of visual autonomous driving agents in the context of imitation learning. Current methods often rely on a classification-based pre-training, which we hypothesise to be holding back from extending capabilities of implicit image understanding. We propose pre-training the visual encoder of a driving agent using the self-distillation with no labels (DINO) method, which relies on a self-supervised learning paradigm.% and is trained on an unrelated task. Our experiments in CARLA environment in accordance with the Leaderboard benchmark reveal that the proposed pre-training is more efficient than classification-based pre-training, and is on par with the recently proposed pre-training based on visual place recognition (VPRPre).
Paper Structure (11 sections, 9 equations, 2 figures, 4 tables)

This paper contains 11 sections, 9 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Mean route completion (%) of evaluating agents over three seeds on the offline Leaderboard benchmark on training conditions (left) and testing conditions (right).
  • Figure 2: Mean distance completion (%) of evaluating agents over three seeds on the offline Leaderboard benchmark on training conditions (left) and testing conditions (right).