Table of Contents
Fetching ...

iPad: Iterative Proposal-centric End-to-End Autonomous Driving

Ke Guo, Haochen Liu, Xiaojun Wu, Jia Pan, Chen Lv

TL;DR

iPad introduces an iterative, proposal-centric end-to-end driving framework that centers planning around sparse BEV proposals. The ProFormer BEV encoder refines these proposals via proposal-anchored attention, while two lightweight auxiliary tasks—mapping and prediction—tighten planning relevance without heavy computation. A scoring module selects the best proposal, and training optimizes a joint loss that directly ties perception and representation learning to planning quality. Empirical results on NAVSIM and Bench2Drive demonstrate state-of-the-art performance and substantial efficiency gains over prior dense BEV-based methods, with strong scalability and interpretability benefits.

Abstract

End-to-end (E2E) autonomous driving systems offer a promising alternative to traditional modular pipelines by reducing information loss and error accumulation, with significant potential to enhance both mobility and safety. However, most existing E2E approaches directly generate plans based on dense bird's-eye view (BEV) grid features, leading to inefficiency and limited planning awareness. To address these limitations, we propose iterative Proposal-centric autonomous driving (iPad), a novel framework that places proposals - a set of candidate future plans - at the center of feature extraction and auxiliary tasks. Central to iPad is ProFormer, a BEV encoder that iteratively refines proposals and their associated features through proposal-anchored attention, effectively fusing multi-view image data. Additionally, we introduce two lightweight, proposal-centric auxiliary tasks - mapping and prediction - that improve planning quality with minimal computational overhead. Extensive experiments on the NAVSIM and CARLA Bench2Drive benchmarks demonstrate that iPad achieves state-of-the-art performance while being significantly more efficient than prior leading methods.

iPad: Iterative Proposal-centric End-to-End Autonomous Driving

TL;DR

iPad introduces an iterative, proposal-centric end-to-end driving framework that centers planning around sparse BEV proposals. The ProFormer BEV encoder refines these proposals via proposal-anchored attention, while two lightweight auxiliary tasks—mapping and prediction—tighten planning relevance without heavy computation. A scoring module selects the best proposal, and training optimizes a joint loss that directly ties perception and representation learning to planning quality. Empirical results on NAVSIM and Bench2Drive demonstrate state-of-the-art performance and substantial efficiency gains over prior dense BEV-based methods, with strong scalability and interpretability benefits.

Abstract

End-to-end (E2E) autonomous driving systems offer a promising alternative to traditional modular pipelines by reducing information loss and error accumulation, with significant potential to enhance both mobility and safety. However, most existing E2E approaches directly generate plans based on dense bird's-eye view (BEV) grid features, leading to inefficiency and limited planning awareness. To address these limitations, we propose iterative Proposal-centric autonomous driving (iPad), a novel framework that places proposals - a set of candidate future plans - at the center of feature extraction and auxiliary tasks. Central to iPad is ProFormer, a BEV encoder that iteratively refines proposals and their associated features through proposal-anchored attention, effectively fusing multi-view image data. Additionally, we introduce two lightweight, proposal-centric auxiliary tasks - mapping and prediction - that improve planning quality with minimal computational overhead. Extensive experiments on the NAVSIM and CARLA Bench2Drive benchmarks demonstrate that iPad achieves state-of-the-art performance while being significantly more efficient than prior leading methods.

Paper Structure

This paper contains 26 sections, 10 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Comparison of end-to-end paradigms. (a) Dense one-shot, grid-centric methods generate BEV features for every cell and directly output the final plan based on the extracted dense BEV grid features. (b) iPad iteratively refines sparse BEV proposals and their queries, concentrating feature extraction on the regions most relevant to planning by using the proposal corner points as anchors.
  • Figure 2: Overview of the iPad framework, consisting of four key components: the Scene Encoder (gray) extracts image and ego features; the ProFormer (blue) initializes BEV proposal queries with ego features and iteratively refines them using the image features; Scorer (green) predicts a score for each proposal trajectory; and the Proposal-Centric Mapping and Prediction (red) predict passability maps and agent future states related to potential collisions.
  • Figure 3: Scaling law in iPad. The PDM score performance on the NAVSIM Benchmark increases logarithmically with the proposal number, iteration number and training data size,
  • Figure 4: Qualitative planning and collision prediction results on NAVSIM and Bench2Drive. Proposal lines are shaded with brightness proportional to their predicted scores, while the brightness of predicted agent boxes reflects their associated proposals.
  • Figure 5: Detailed architecture of ProFormer. The proposals are used to query deformable proposal-centric image features ${\bm{I}}$ (yellow) to update the proposal features.
  • ...and 7 more figures