Table of Contents
Fetching ...

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, Feng Guo

TL;DR

The study addresses the gap in evaluating Vision-Language Large Models for safety-critical driving by introducing DVBench, a hierarchical taxonomy-based benchmark with 25 granular abilities and a 10k MCQ question bank derived from safety-critical driving videos. It couples automated annotation with an evaluation framework, including GroupEval to mitigate position bias, and benchmarks 14 SOTA VLLMs in zero-shot settings, revealing significant gaps (none exceed 40% accuracy) especially in reasoning. The work also demonstrates the value of domain knowledge and targeted fine-tuning, achieving notable gains (up to ~11 percentage points) and highlighting the necessity of domain adaptation for real-world autonomous driving systems. Together, DVBench provides a structured, open framework to advance VLLMs toward the safety and robustness required for mission-critical driving tasks.

Abstract

Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs' performance in safety-critical scenarios. To address this, we introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers, enabling a comprehensive evaluation of VLLMs' capabilities in perception and reasoning. Experiments on 14 SOTA VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between general-purpose VLLMs and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: https://github.com/tong-zeng/DVBench.git.

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

TL;DR

The study addresses the gap in evaluating Vision-Language Large Models for safety-critical driving by introducing DVBench, a hierarchical taxonomy-based benchmark with 25 granular abilities and a 10k MCQ question bank derived from safety-critical driving videos. It couples automated annotation with an evaluation framework, including GroupEval to mitigate position bias, and benchmarks 14 SOTA VLLMs in zero-shot settings, revealing significant gaps (none exceed 40% accuracy) especially in reasoning. The work also demonstrates the value of domain knowledge and targeted fine-tuning, achieving notable gains (up to ~11 percentage points) and highlighting the necessity of domain adaptation for real-world autonomous driving systems. Together, DVBench provides a structured, open framework to advance VLLMs toward the safety and robustness required for mission-critical driving tasks.

Abstract

Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs' performance in safety-critical scenarios. To address this, we introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers, enabling a comprehensive evaluation of VLLMs' capabilities in perception and reasoning. Experiments on 14 SOTA VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between general-purpose VLLMs and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: https://github.com/tong-zeng/DVBench.git.

Paper Structure

This paper contains 21 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Performance of nine representative Vision Large Language Models, ranging from 0.5B to 72B parameters, evaluated across the 25 driving video comprehension abilities defined in DVBench.
  • Figure 2: The DVBench Ability Hierarchy is structured across three distinct levels of capability dimensions: 2 foundational L1 abilities (perception and reasoning), 10 specialized L2 abilities for key cognitive tasks, and 25 granular L3 abilities that capture specific assessment criteria.
  • Figure 3: Examples of L2 Ability Taxonomy questions: perception and reasoning. Perception abilities focus on low-level skills such as object recognition, environmental condition parsing, and infrastructure comprehension, while reasoning abilities encompass higher-level cognitive capacities like inference, prediction, and causal analysis.
  • Figure 4: GroupEval Strategy. In GroupEval, each question is tested multiple times, with the correct answer's position changing each time while the other options are shuffled randomly. Here, the IndividualEval strategy deems the VLLM successful, whereas GroupEval considers it unsuccessful, as the VLLM fails to consistently identify the correct answer across different trials.
  • Figure 5: Distribution of Ground-Truth Answers and Sample VLLM Predictions: In DVBench, there exist some questions that offer only 2 or 3 answer choices, leading to a slightly uneven distribution of ground-truth answers.
  • ...and 2 more figures