Table of Contents
Fetching ...

Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets

Alex Koran, Dimitrios Sinodinos, Hadi Hojjati, Takuya Nanri, Fangge Chen, Narges Armanfard

Abstract

High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.

Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets

Abstract

High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.

Paper Structure

This paper contains 40 sections, 8 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Infraction rates by TF++ jaeger2023hidden on official CARLA Leaderboard training (Town 12) and validation (Town 13) routes. Collisions dominate total infractions, especially in validation.
  • Figure 2: Architecture overview. VLAAD Training: Using multimodal video-caption pairs as input, the XCLIP backbone remains frozen while the Adapter and Detector are optimized jointly using the cosine embedding loss, $\mathcal{L}_{cos}$, and binary cross-entropy loss, $\mathcal{L}_{bce}$. Driving Model Fine-Tuning & Inference: VLAAD operates as a frozen module, and only the video encoder is used. Its collision risk score is concatenated with high-level states (velocity, navigation commands) and passed to the global encoder to condition the transformer decoder's waypoint and speed predictions. At timestep $t$, the risk token is computed causally from the past buffer $[t-K,t]$ only (no future frames).
  • Figure 3: Overview of the CARLA-Collide data generation pipeline. Collision Clips: We record 40-frame (10s) segments from an online sensor agent (TF++), centering the collision event between 2.5s and 7.5s. Corresponding infraction logs are summarized by an LLM to produce event-specific captions. Normal Clips: Leveraging offline expert data from SimLingo, we perform a two-stage LLM summarization—first at the frame level (incorporating VQA and scenario data) and then at the clip level—to generate cohesive driving descriptions.
  • Figure 4: Temporal collision-risk predictions of VLAAD with and without MIL. (a) Real-Collide clip where a pickup truck cuts in and causes a collision (evaluated on the 20% held-out split).(b) CARLA-Collide validation clip in rain involving a minor collision and an open-door hazard (model trained on MMAU and BDDX).
  • Figure 5: Three normal driving clips from Real-Collide. VLAAD with MIL stays near zero across all videos, correctly identifying normal behavior, while the model w/o MIL shows persistent nonzero responses. Examples are from the held-out 20% test split.
  • ...and 6 more figures