Table of Contents
Fetching ...

Active Learning from Scene Embeddings for End-to-End Autonomous Driving

Wenhao Jiang, Duo Li, Menghan Hu, Chao Ma, Ke Wang, Zhipeng Zhang

TL;DR

SEAD tackles labeling bottlenecks in end-to-end autonomous driving by introducing a BEV-feature–driven active learning framework. It builds an initial diverse dataset from static and dynamic scene cues and then iteratively selects valuable scenes and consecutive key frames using BEV feature shifts and carefully defined thresholds, all within a formal objective $G = \max (P/B)$. On nuScenes with a lightweight VAD-Tiny backbone, SEAD matches full-dataset planning performance using only a 30% labeling budget, outperforming several baselines in open-loop planning metrics. Ablation studies and visualizations confirm the effectiveness and robustness of each module, highlighting the practical data-efficiency gains of BEV-centric data valuation for E2E-AD across models and datasets.

Abstract

In the field of autonomous driving, end-to-end deep learning models show great potential by learning driving decisions directly from sensor data. However, training these models requires large amounts of labeled data, which is time-consuming and expensive. Considering that the real-world driving data exhibits a long-tailed distribution where simple scenarios constitute a majority part of the data, we are thus inspired to identify the most challenging scenarios within it. Subsequently, we can efficiently improve the performance of the model by training with the selected data of the highest value. Prior research has focused on the selection of valuable data by empirically designed strategies. However, manually designed methods suffer from being less generalizable to new data distributions. Observing that the BEV (Bird's Eye View) features in end-to-end models contain all the information required to represent the scenario, we propose an active learning framework that relies on these vectorized scene-level features, called SEAD. The framework selects initial data based on driving-environmental information and incremental data based on BEV features. Experiments show that we only need 30\% of the nuScenes training data to achieve performance close to what can be achieved with the full dataset. The source code will be released.

Active Learning from Scene Embeddings for End-to-End Autonomous Driving

TL;DR

SEAD tackles labeling bottlenecks in end-to-end autonomous driving by introducing a BEV-feature–driven active learning framework. It builds an initial diverse dataset from static and dynamic scene cues and then iteratively selects valuable scenes and consecutive key frames using BEV feature shifts and carefully defined thresholds, all within a formal objective . On nuScenes with a lightweight VAD-Tiny backbone, SEAD matches full-dataset planning performance using only a 30% labeling budget, outperforming several baselines in open-loop planning metrics. Ablation studies and visualizations confirm the effectiveness and robustness of each module, highlighting the practical data-efficiency gains of BEV-centric data valuation for E2E-AD across models and datasets.

Abstract

In the field of autonomous driving, end-to-end deep learning models show great potential by learning driving decisions directly from sensor data. However, training these models requires large amounts of labeled data, which is time-consuming and expensive. Considering that the real-world driving data exhibits a long-tailed distribution where simple scenarios constitute a majority part of the data, we are thus inspired to identify the most challenging scenarios within it. Subsequently, we can efficiently improve the performance of the model by training with the selected data of the highest value. Prior research has focused on the selection of valuable data by empirically designed strategies. However, manually designed methods suffer from being less generalizable to new data distributions. Observing that the BEV (Bird's Eye View) features in end-to-end models contain all the information required to represent the scenario, we propose an active learning framework that relies on these vectorized scene-level features, called SEAD. The framework selects initial data based on driving-environmental information and incremental data based on BEV features. Experiments show that we only need 30\% of the nuScenes training data to achieve performance close to what can be achieved with the full dataset. The source code will be released.

Paper Structure

This paper contains 21 sections, 4 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overall pipeline of SEAD. Following the active learning setup, we first build an initial training set based on the initial strategy to train the model. Subsequently, we leverage the trained model and incremental selection strategy to actively select new data, iterating this process continuously. In this context, $D/S$ represents dynamic/static.
  • Figure 2: Initial selection. Constructing a diverse initial dataset by leveraging the rich information within the AD dataset.
  • Figure 3: Calculate $FS$ for scenes. Convert the computation of BEV $FS$ to focus on the shifts of key elements, specifically the agent and map features. Then, accumulate the frame-to-frame $FS$ results to determine the $FS$ for the scene.
  • Figure 4: Visualization of Selected Scenes and Frames. The front camera view is used to display the selected data. The $Scene$ represents a chosen scenario, with ✓ indicating the extracted key frames. BEV representation is provided to visualize the scenario value more intuitively.
  • Figure 5: Calcutation of $FS$. We evaluated the impact of different $FS$ calculation methods on the results. Specifically, $BEV$ represents $FS$ calculated directly from BEV features, while $A\&M$ corresponds to calculations based on agent and map features.