Table of Contents
Fetching ...

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

David Holtz, Niklas Hanselmann, Simon Doll, Marius Cordts, Bernt Schiele

Abstract

End-to-end autonomous driving has gained significant attention for its potential to learn robust behavior in interactive scenarios and scale with data. Popular architectures often build on separate modules for perception and planning connected through latent representations, such as bird's eye view feature grids, to maintain end-to-end differentiability. This paradigm emerged mostly on open-loop datasets, with evaluation focusing not only on driving performance, but also intermediate perception tasks. Unfortunately, architectural advances that excel in open-loop often fail to translate to scalable learning of robust closed-loop driving. In this paper, we systematically re-examine the impact of common architectural patterns on closed-loop performance: (1) high-resolution perceptual representations, (2) disentangled trajectory representations, and (3) generative planning. Crucially, our analysis evaluates the combined impact of these patterns, revealing both unexpected limitations as well as underexplored synergies. Building on these insights, we introduce BevAD, a novel lightweight and highly scalable end-to-end driving architecture. BevAD achieves 72.7% success rate on the Bench2Drive benchmark and demonstrates strong data-scaling behavior using pure imitation learning. Our code and models are publicly available here: https://dmholtz.github.io/bevad/

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

Abstract

End-to-end autonomous driving has gained significant attention for its potential to learn robust behavior in interactive scenarios and scale with data. Popular architectures often build on separate modules for perception and planning connected through latent representations, such as bird's eye view feature grids, to maintain end-to-end differentiability. This paradigm emerged mostly on open-loop datasets, with evaluation focusing not only on driving performance, but also intermediate perception tasks. Unfortunately, architectural advances that excel in open-loop often fail to translate to scalable learning of robust closed-loop driving. In this paper, we systematically re-examine the impact of common architectural patterns on closed-loop performance: (1) high-resolution perceptual representations, (2) disentangled trajectory representations, and (3) generative planning. Crucially, our analysis evaluates the combined impact of these patterns, revealing both unexpected limitations as well as underexplored synergies. Building on these insights, we introduce BevAD, a novel lightweight and highly scalable end-to-end driving architecture. BevAD achieves 72.7% success rate on the Bench2Drive benchmark and demonstrates strong data-scaling behavior using pure imitation learning. Our code and models are publicly available here: https://dmholtz.github.io/bevad/
Paper Structure (26 sections, 7 equations, 15 figures, 9 tables)

This paper contains 26 sections, 7 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Architectural Patterns. (a) High-resolution BEV features facilitate perception tasks, but promote overfitting the planner. (b) Closed-loop methods prefer path over trajectory representations due to robust steering. (c) Point-estimates interpolate between trajectory modes that diffusion-based sampling can breed.
  • Figure 2: Analysis Framework. (a) We build our analysis framework on ParaDrive PARADriveParallelized2024. (b) We introduce a scene tokenizer to reduce the spatial resolution of the BEV features. The design of our planning head is based on a diffusion transformer ScalableDiffusionModels2023. Crucially, the choice of the planning queries determines whether the planner is modeled as a point estimator or by diffusion.
  • Figure 3: Qualitative visualization of the planning queries' cross-attention to BEV features.\ref{['fig:tok1-full']}. The planner attends to distant BEV cells. Despite strong attention on the traffic light, the autonomous vehicle runs the red light. \ref{['fig:tok1-masked']}: There are numerous attention spikes to random BEV cells, but barely no attention to the oncoming traffic. \ref{['fig:tok4-masked']}: The attention map significantly simplifies and exhibits fewer attention outliers.
  • Figure 4: Scaling Properties. Diffusion demonstrates superior performance over point estimators when scaled with sufficient training data, despite initially underperforming with limited data.
  • Figure 5: Yield to Emergency Vehicle. By increasing the training dataset size, BevAD-M learns to yield to emergency vehicles (red) on highways by safely merging into slower traffic. This cability is absent at smaller data scales (BevAD-S) and in prior leading closed-loop methodsSimLingoVisionOnly2025HiPADHierarchical2025.
  • ...and 10 more figures