Table of Contents
Fetching ...

End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation

Mingzhe Guo, Zhipeng Zhang, Yuan He, Ke Wang, Liping Jing

TL;DR

The paper tackles the annotation and computation bottlenecks of end-to-end autonomous driving by proposing UAD, an unsupervised E2EAD framework. It introduces an Angular Perception Pretext to model spatial objectness and temporal dynamics without 3D labels, and a self-supervised Direction-Aware Planning mechanism that enforces trajectory consistency across augmented views. Empirical results show state-of-the-art open-loop performance on nuScenes and substantially improved closed-loop driving in CARLA Town05 Long, along with major efficiency gains (training budget and inference speed) over prior methods like UniAD and VAD. The approach also demonstrates compatibility with optional 3D heads for safety checks, highlighting practical applicability in real-world systems.

Abstract

We propose UAD, a method for vision-based end-to-end autonomous driving (E2EAD), achieving the best open-loop evaluation performance in nuScenes, meanwhile showing robust closed-loop driving quality in CARLA. Our motivation stems from the observation that current E2EAD models still mimic the modular architecture in typical driving stacks, with carefully designed supervised perception and prediction subtasks to provide environment information for oriented planning. Although achieving groundbreaking progress, such design has certain drawbacks: 1) preceding subtasks require massive high-quality 3D annotations as supervision, posing a significant impediment to scaling the training data; 2) each submodule entails substantial computation overhead in both training and inference. To this end, we propose UAD, an E2EAD framework with an unsupervised proxy to address all these issues. Firstly, we design a novel Angular Perception Pretext to eliminate the annotation requirement. The pretext models the driving scene by predicting the angular-wise spatial objectness and temporal dynamics, without manual annotation. Secondly, a self-supervised training strategy, which learns the consistency of the predicted trajectories under different augment views, is proposed to enhance the planning robustness in steering scenarios. Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate in nuScenes and surpasses VAD for 41.32 points on the driving score in CARLA's Town05 Long benchmark. Moreover, the proposed method only consumes 44.3% training resources of UniAD and runs 3.4 times faster in inference. Our innovative design not only for the first time demonstrates unarguable performance advantages over supervised counterparts, but also enjoys unprecedented efficiency in data, training, and inference. Code and models will be released at https://github.com/KargoBot_Research/UAD.

End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation

TL;DR

The paper tackles the annotation and computation bottlenecks of end-to-end autonomous driving by proposing UAD, an unsupervised E2EAD framework. It introduces an Angular Perception Pretext to model spatial objectness and temporal dynamics without 3D labels, and a self-supervised Direction-Aware Planning mechanism that enforces trajectory consistency across augmented views. Empirical results show state-of-the-art open-loop performance on nuScenes and substantially improved closed-loop driving in CARLA Town05 Long, along with major efficiency gains (training budget and inference speed) over prior methods like UniAD and VAD. The approach also demonstrates compatibility with optional 3D heads for safety checks, highlighting practical applicability in real-world systems.

Abstract

We propose UAD, a method for vision-based end-to-end autonomous driving (E2EAD), achieving the best open-loop evaluation performance in nuScenes, meanwhile showing robust closed-loop driving quality in CARLA. Our motivation stems from the observation that current E2EAD models still mimic the modular architecture in typical driving stacks, with carefully designed supervised perception and prediction subtasks to provide environment information for oriented planning. Although achieving groundbreaking progress, such design has certain drawbacks: 1) preceding subtasks require massive high-quality 3D annotations as supervision, posing a significant impediment to scaling the training data; 2) each submodule entails substantial computation overhead in both training and inference. To this end, we propose UAD, an E2EAD framework with an unsupervised proxy to address all these issues. Firstly, we design a novel Angular Perception Pretext to eliminate the annotation requirement. The pretext models the driving scene by predicting the angular-wise spatial objectness and temporal dynamics, without manual annotation. Secondly, a self-supervised training strategy, which learns the consistency of the predicted trajectories under different augment views, is proposed to enhance the planning robustness in steering scenarios. Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate in nuScenes and surpasses VAD for 41.32 points on the driving score in CARLA's Town05 Long benchmark. Moreover, the proposed method only consumes 44.3% training resources of UniAD and runs 3.4 times faster in inference. Our innovative design not only for the first time demonstrates unarguable performance advantages over supervised counterparts, but also enjoys unprecedented efficiency in data, training, and inference. Code and models will be released at https://github.com/KargoBot_Research/UAD.

Paper Structure

This paper contains 28 sections, 9 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: (a) End-to-end autonomous driving paradigms. 1) The vanilla architecture that directly predicts control command. 2) The modularized design that combines various preceding tasks. 3) Our proposed framework with unsupervised pretext task. (b) Comparison of training cost, inference speed and average $\rm L2$ error between our method and UniADVAD on 8 NVIDIA Tesla A100 GPUs.
  • Figure 2: The architecture of our UAD. The inference pipeline is marked by black arrows with blue background, which plans ego trajectory based on the input multi-view images. The training pipeline consists of Angular Perception Pretext (orange arrows with khaki background) and Direction-Aware Planning (orange arrows with purple background). "F" in BEV feature indicates the driving direction.
  • Figure 3: (a) Label generation for angular perception pretext. (b) Illustration of dreaming decoder.
  • Figure 4: Illustration of direction-aware learning strategy.
  • Figure 5: (a) Qualitative results in nuScenes. (b) Qualitative results in CARLA.
  • ...and 6 more figures