Table of Contents
Fetching ...

Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM

Jack Qin, Zhitao Wang, Yinan Zheng, Keyu Chen, Yang Zhou, Yuanxin Zhong, Siyuan Cheng

TL;DR

This work tackles the generalization gap of end-to-end autonomous driving in long-tail, unseen scenarios. It presents Risk Semantic Distillation (RSD), a plug-in framework that distills risk-aware perception from Vision-Language Models into BEV-based E2E backbones via a RiskHead. RSD leverages cross-view PV-to-BEV projection, deformable attention, and nearest-neighbor matching to align VLM-derived risk semantics with BEV features and uses a $L_1$ loss on predicted vs ground-truth risk, $L_{risk} = \| R_{pred} - R_{gt} \|_1$, and a risk consistency metric $Diff\_Risk$ to supervise risk annotation. Empirical results on Bench2Drive show improved perception, planning, and closed-loop safety with a lightweight model (approximately 50M parameters), enabling real-time deployment without finetuning large VLMs.

Abstract

The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.

Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM

TL;DR

This work tackles the generalization gap of end-to-end autonomous driving in long-tail, unseen scenarios. It presents Risk Semantic Distillation (RSD), a plug-in framework that distills risk-aware perception from Vision-Language Models into BEV-based E2E backbones via a RiskHead. RSD leverages cross-view PV-to-BEV projection, deformable attention, and nearest-neighbor matching to align VLM-derived risk semantics with BEV features and uses a loss on predicted vs ground-truth risk, , and a risk consistency metric to supervise risk annotation. Empirical results on Bench2Drive show improved perception, planning, and closed-loop safety with a lightweight model (approximately 50M parameters), enabling real-time deployment without finetuning large VLMs.

Abstract

The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.

Paper Structure

This paper contains 26 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The framework of Risk Semantic Distillation. The diagram illustrates the integration of an End-to-End (E2E) architecture into production vehicles, utilizing the Visual-Language Model (VLM) for risk object identification and subsequent distillation. The VLM processes visual and textual information to identify and categorize critical objects, which are then used to enhance the E2E framework through distillation. This process improves the vehicle's ability to recognize and prioritize risk objects, enabling better risk attention capabilities. As a result, the system significantly enhances the safety of autonomous driving by ensuring the vehicle can detect and respond to potential risks more effectively.
  • Figure 2: VLM-enhanced Risk Semantic Annotation. A sequence of images is processed through key object description, risk semantic Chain-of-Thought (COT), and a semantic mask drawer. The sequence begins with the extraction of critical objects (such as cars), followed by generating a risk score and reasoning using COT to assess proximity and risk levels. The semantic mask drawer is then used to visualize the detected objects and their associated risk annotations.
  • Figure 3: The framework of Risk Semantic Distillation (RSD). This figure illustrates the process of projecting the Perspective View (PV) query into the BEV space, followed by the BEV re-batching procedure and nearest neighbor matching. It showcases the integration of point sampling, BEV masking, and the alignment of the reference camera with the 3D lidar-to-camera matrix. The system utilizes deformable attention for optimizing the risk semantic information extraction, with sampling offsets, attention weights, and locations playing crucial roles in enhancing the precision of risk object prediction. The final output is used for accurate decision-making in autonomous driving systems.
  • Figure 4: Risk Head Reconstruction. The reconstructed images show how different levels of risk are highlighted from the BEV features, guiding the learning of more effective BEV representations. This process helps the system focus on critical objects and scenarios, improving the vehicle's ability to navigate and make safe decisions in real-world environments.