Active Learning from Scene Embeddings for End-to-End Autonomous Driving
Wenhao Jiang, Duo Li, Menghan Hu, Chao Ma, Ke Wang, Zhipeng Zhang
TL;DR
SEAD tackles labeling bottlenecks in end-to-end autonomous driving by introducing a BEV-feature–driven active learning framework. It builds an initial diverse dataset from static and dynamic scene cues and then iteratively selects valuable scenes and consecutive key frames using BEV feature shifts and carefully defined thresholds, all within a formal objective $G = \max (P/B)$. On nuScenes with a lightweight VAD-Tiny backbone, SEAD matches full-dataset planning performance using only a 30% labeling budget, outperforming several baselines in open-loop planning metrics. Ablation studies and visualizations confirm the effectiveness and robustness of each module, highlighting the practical data-efficiency gains of BEV-centric data valuation for E2E-AD across models and datasets.
Abstract
In the field of autonomous driving, end-to-end deep learning models show great potential by learning driving decisions directly from sensor data. However, training these models requires large amounts of labeled data, which is time-consuming and expensive. Considering that the real-world driving data exhibits a long-tailed distribution where simple scenarios constitute a majority part of the data, we are thus inspired to identify the most challenging scenarios within it. Subsequently, we can efficiently improve the performance of the model by training with the selected data of the highest value. Prior research has focused on the selection of valuable data by empirically designed strategies. However, manually designed methods suffer from being less generalizable to new data distributions. Observing that the BEV (Bird's Eye View) features in end-to-end models contain all the information required to represent the scenario, we propose an active learning framework that relies on these vectorized scene-level features, called SEAD. The framework selects initial data based on driving-environmental information and incremental data based on BEV features. Experiments show that we only need 30\% of the nuScenes training data to achieve performance close to what can be achieved with the full dataset. The source code will be released.
