Table of Contents
Fetching ...

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

Yue Li, Meng Tian, Zhenyu Lin, Jiangtong Zhu, Dechang Zhu, Haiqiang Liu, Zining Wang, Yueyi Zhang, Zhiwei Xiong, Xinhai Zhao

TL;DR

VLADBench introduces a fine-grained, multi-domain benchmark for evaluating Vision-Language Models in autonomous driving, addressing gaps in coarse-grained VLM assessments. It defines 5 domains, 11 secondary aspects, and 29 tertiary tasks across 2,000 static and 3,000 dynamic scenes, supplemented by 1.4M domain-specific QAs for training. Comprehensive experiments show current VLMs achieve sub-60% accuracy, reveal cross-domain synergy effects, and demonstrate that vision-encoder capacity can outweigh language-model scale for AD tasks. The work highlights meaningful gaps and provides a route toward cognitively sophisticated AD systems through fine-grained evaluation and domain-aware training.

Abstract

Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $\textbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $\textbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

TL;DR

VLADBench introduces a fine-grained, multi-domain benchmark for evaluating Vision-Language Models in autonomous driving, addressing gaps in coarse-grained VLM assessments. It defines 5 domains, 11 secondary aspects, and 29 tertiary tasks across 2,000 static and 3,000 dynamic scenes, supplemented by 1.4M domain-specific QAs for training. Comprehensive experiments show current VLMs achieve sub-60% accuracy, reveal cross-domain synergy effects, and demonstrate that vision-encoder capacity can outweigh language-model scale for AD tasks. The work highlights meaningful gaps and provides a route toward cognitively sophisticated AD systems through fine-grained evaluation and domain-aware training.

Abstract

Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce , a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.

Paper Structure

This paper contains 23 sections, 31 figures, 9 tables.

Figures (31)

  • Figure 1: A sunburst chart of VLADBench categories. The proposed dataset spans 5 key domains, 11 secondary aspects and 29 tertiary tasks, including about 2,000 static scenes and 3000 dynamic scenarios, comprising 12,000 close-form questions.
  • Figure 2: Real-world examples of the tasks in (a) Traffic Knowledge Understanding, (b) General Element Recognition, and (c, d) Traffic Graph Generation domains. 'Rec.' and 'RL' denote recognition and relation.
  • Figure 3: Examples in intention judgment and ego action reasoning aspects. ST.RL.: spatio-temporal reasoning, K.O.D.: key object detection.
  • Figure 4: Gain chart of the five key domains. This chart shows the performance improvements of models trained on datasets categorized by the five key domains, evaluated on ADBench, compared to the base model.
  • Figure 5: Examples of Pavement Marking.
  • ...and 26 more figures