Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM
Jack Qin, Zhitao Wang, Yinan Zheng, Keyu Chen, Yang Zhou, Yuanxin Zhong, Siyuan Cheng
TL;DR
This work tackles the generalization gap of end-to-end autonomous driving in long-tail, unseen scenarios. It presents Risk Semantic Distillation (RSD), a plug-in framework that distills risk-aware perception from Vision-Language Models into BEV-based E2E backbones via a RiskHead. RSD leverages cross-view PV-to-BEV projection, deformable attention, and nearest-neighbor matching to align VLM-derived risk semantics with BEV features and uses a $L_1$ loss on predicted vs ground-truth risk, $L_{risk} = \| R_{pred} - R_{gt} \|_1$, and a risk consistency metric $Diff\_Risk$ to supervise risk annotation. Empirical results on Bench2Drive show improved perception, planning, and closed-loop safety with a lightweight model (approximately 50M parameters), enabling real-time deployment without finetuning large VLMs.
Abstract
The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.
