Table of Contents
Fetching ...

VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion

Pei Liu, Haipeng Liu, Haichao Liu, Xin Liu, Jinxin Ni, Jun Ma

TL;DR

This work tackles the gap in end-to-end autonomous driving where driver attentional semantics are not explicitly modeled. It introduces VLM-E2E, which fuses Vision-Language Model derived textual cues with BEV features through a learnable BEV-Text fusion and semantic refinement, enabling richer perception, prediction, and planning. A VLM-based text annotation pipeline (BLIP-2) with CLIP-based text embeddings, plus a text interaction module and spatiotemporal BEV representation, drive attention-guided planning. Experiments on nuScenes demonstrate improvements across perception, 2s predictions, and trajectory planning, with ablations confirming CLIP's effectiveness and front-view text as the most informative modality.

Abstract

Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM-E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver's attentional semantics. By focusing on attentional semantics, VLM-E2E better aligns with human-like driving behavior, which is critical for navigating dynamic and complex environments. Furthermore, we introduce a BEV-Text learnable weighted fusion strategy to address the issue of modality importance imbalance in fusing multimodal information. This approach dynamically balances the contributions of BEV and text features, ensuring that the complementary information from visual and textual modalities is effectively utilized. By explicitly addressing the imbalance in multimodal fusion, our method facilitates a more holistic and robust representation of driving environments. We evaluate VLM-E2E on the nuScenes dataset and achieve significant improvements in perception, prediction, and planning over the baseline end-to-end model, showcasing the effectiveness of our attention-enhanced BEV representation in enabling more accurate and reliable autonomous driving tasks.

VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion

TL;DR

This work tackles the gap in end-to-end autonomous driving where driver attentional semantics are not explicitly modeled. It introduces VLM-E2E, which fuses Vision-Language Model derived textual cues with BEV features through a learnable BEV-Text fusion and semantic refinement, enabling richer perception, prediction, and planning. A VLM-based text annotation pipeline (BLIP-2) with CLIP-based text embeddings, plus a text interaction module and spatiotemporal BEV representation, drive attention-guided planning. Experiments on nuScenes demonstrate improvements across perception, 2s predictions, and trajectory planning, with ablations confirming CLIP's effectiveness and front-view text as the most informative modality.

Abstract

Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM-E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver's attentional semantics. By focusing on attentional semantics, VLM-E2E better aligns with human-like driving behavior, which is critical for navigating dynamic and complex environments. Furthermore, we introduce a BEV-Text learnable weighted fusion strategy to address the issue of modality importance imbalance in fusing multimodal information. This approach dynamically balances the contributions of BEV and text features, ensuring that the complementary information from visual and textual modalities is effectively utilized. By explicitly addressing the imbalance in multimodal fusion, our method facilitates a more holistic and robust representation of driving environments. We evaluate VLM-E2E on the nuScenes dataset and achieve significant improvements in perception, prediction, and planning over the baseline end-to-end model, showcasing the effectiveness of our attention-enhanced BEV representation in enabling more accurate and reliable autonomous driving tasks.

Paper Structure

This paper contains 18 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: VLM-E2E augments the end-to-end driving model with semantic textual descriptions during training. These descriptions extract driver attention from VLMs to encourage the model to learn richer attentional semantics.
  • Figure 2: We present VLM-E2E, a driver attention enhanced end-to-end vision-based framework. VLM-E2E consists of three modules: VLM-based Text Annotation Generation, Text Interaction Guidance Module, and Vision-based End-to-end Model.
  • Figure 3: Qualitative analysis on prediction. (a) shows the multi-view input images. (b) shows the heatmap (blue to red) which illustrates the probability distribution of instance centers within the scene, with warmer colors indicating higher confidence regions. (c) represents the vehicles segmentation, which effectively distinguishes individual instances in the complex traffic scenario. (d) reveals the directional vectors pointing towards the corresponding instance centers for each pixel, demonstrating the model’s understanding of spatial relationships. (e) exhibits consistency within each instance, reflecting the characteristic rigid-body motion of vehicles.
  • Figure 4: Qualitative analysis on planning. (a) shows the multi-view input images. (b) shows the planned trajectory (blue). (d) presents the learned costmap with a warmer color indicates a lower cost.