Navigation-Guided Sparse Scene Representation for End-to-End Autonomous Driving

Peidong Li; Dixiao Cui

Navigation-Guided Sparse Scene Representation for End-to-End Autonomous Driving

Peidong Li, Dixiao Cui

TL;DR

SSR addresses the annotation and deployment bottlenecks of perception-heavy E2EAD by learning a sparse, navigation-guided scene representation using 16 tokens. It leverages a Future Feature Predictor for self-supervised temporal alignment, enabling end-to-end planning without explicit perception supervision. Empirically, SSR achieves state-of-the-art open-loop performance on nuScenes (reducing $L_2$ by 27.2% and collision rate by 51.6% relative to UniAD) and significantly faster training and inference, plus superior driving scores in CARLA Town05 Long. The approach demonstrates real-time efficiency and interpretability via sparse tokens and attention visualizations, suggesting strong potential for scalable deployment in real-world autonomous driving systems.

Abstract

End-to-End Autonomous Driving (E2EAD) methods typically rely on supervised perception tasks to extract explicit scene information (e.g., objects, maps). This reliance necessitates expensive annotations and constrains deployment and data scalability in real-time applications. In this paper, we introduce SSR, a novel framework that utilizes only 16 navigation-guided tokens as Sparse Scene Representation, efficiently extracting crucial scene information for E2EAD. Our method eliminates the need for human-designed supervised sub-tasks, allowing computational resources to concentrate on essential elements directly related to navigation intent. We further introduce a temporal enhancement module, aligning predicted future scenes with actual future scenes through self-supervision. SSR achieves a 27.2\% relative reduction in L2 error and a 51.6\% decrease in collision rate to UniAD in nuScenes, with a 10.9$\times$ faster inference speed and 13$\times$ faster training time. Moreover, SSR outperforms VAD-Base with a 48.6-point improvement on driving score in CARLA's Town05 Long benchmark. This framework represents a significant leap in real-time autonomous driving systems and paves the way for future scalable deployment. Code is available at https://github.com/PeidongLi/SSR.

Navigation-Guided Sparse Scene Representation for End-to-End Autonomous Driving

TL;DR

by 27.2% and collision rate by 51.6% relative to UniAD) and significantly faster training and inference, plus superior driving scores in CARLA Town05 Long. The approach demonstrates real-time efficiency and interpretability via sparse tokens and attention visualizations, suggesting strong potential for scalable deployment in real-world autonomous driving systems.

Abstract

faster inference speed and 13

faster training time. Moreover, SSR outperforms VAD-Base with a 48.6-point improvement on driving score in CARLA's Town05 Long benchmark. This framework represents a significant leap in real-time autonomous driving systems and paves the way for future scalable deployment. Code is available at https://github.com/PeidongLi/SSR.

Paper Structure (27 sections, 14 equations, 13 figures, 8 tables)

This paper contains 27 sections, 14 equations, 13 figures, 8 tables.

Introduction
Related Work
Vision-Based End-to-End Autonomous Driving
Scene Representation in Autonomous Driving
Method
Overview
Navigation-Guided Scenes Token Learner
Planning based on sparse scene representation
Temporal Enhancement by Future Feature Predictor
Experiments
Dataset and Metric
Implementation Details
Main Result
Ablation Study
Component-wise ablation
...and 12 more sections

Figures (13)

Figure 1: Performance Comparison of Various Methods in Speed and Accuracy on nuScenes.
Figure 2: Comparison of Various End-to-End Paradigms. Compared to previous task-specific supervised paradigms, our adaptive unsupervised approach takes full advantage of end-to-end framework by utilizing navigation-guided perception, without the need to differentiate between sub-tasks.
Figure 3: Overview of SSR: SSR consists of two parts: the purple part, which is used during both training and inference, and the gray part, which is only used during training. In the purple part, the dense BEV feature is first compressed by the Scenes TokenLearner into sparse queries, which are then used for planning via cross-attention. In the gray part, the predicted BEV feature is obtained from the Future Feature Predictor. The future BEV feature is then used to supervise the predicted BEV feature, enhancing both the scene representation and the planning decoder.
Figure 4: Structure of Modules: Scenes TokenLearner and Future Feature Predcitor.
Figure 5: Visualization of BEV Square Attention Map of Scene Queries. Attention maps for 8 of the 16 tokens are displayed. Ego vehicle is located at the center while up direction indicates the front of ego. Brighter areas represent higher attention weights. The full set is provided in Appendix \ref{['all attn map']}.
...and 8 more figures

Navigation-Guided Sparse Scene Representation for End-to-End Autonomous Driving

TL;DR

Abstract

Navigation-Guided Sparse Scene Representation for End-to-End Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (13)