Table of Contents
Fetching ...

AutoDrive-QA: A Multiple-Choice Benchmark for Vision-Language Evaluation in Urban Autonomous Driving

Boshra Khalili, Andrew W. Smyth

TL;DR

AutoDrive-QA tackles the inconsistent evaluation of vision-language models in autonomous urban driving by converting open-ended QA from DriveLM, NuScenes-QA, and LingoQA into structured multiple-choice questions with domain-grounded distractors. The authors design an end-to-end pipeline (dataset conversion, distractor generation, filtering) and a multi-stage quality-control process to ensure a single correct answer and challenging options across perception, prediction, and planning. Empirical results show that task-specific fine-tuning (LLaVA-1.5-7B) yields ~6 percentage point gains, while GPT-4V achieves the strongest zero-shot performance; however, traditional BLEU and CIDEr metrics remain poorly aligned with task success. The work provides a reproducible, interpretable benchmarking framework that highlights common driving errors and offers a foundation for scalable, domain-aware evaluation of urban AI systems.

Abstract

Evaluating vision-language models (VLMs) in urban driving contexts remains challenging, as existing benchmarks rely on open-ended responses that are ambiguous, annotation-intensive, and inconsistent to score. This lack of standardized evaluation slows progress toward safe and reliable AI for urban mobility. We introduce AutoDrive-QA, the first benchmark that systematically converts open-ended driving QA datasets (DriveLM, NuScenes-QA, LingoQA) into structured multiple-choice questions (MCQs) with distractors grounded in five realistic error categories: Driving Domain Misconceptions, Logical Inconsistencies, Misinterpreted Sensor Inputs, Computational Oversights, and Question Ambiguity. This framework enables reproducible and interpretable evaluation of VLMs across perception, prediction, and planning tasks in complex urban scenes. Experiments show that fine-tuning LLaVA-1.5-7B improves accuracy by about six percentage points across tasks, GPT-4V achieves the strongest zero-shot performance with up to 69.8% accuracy, and Qwen2-VL models also perform competitively, particularly in multi-view settings. Moreover, traditional metrics such as BLEU and CIDEr fail to distinguish strong from weak models. By providing an objective, domain-grounded evaluation protocol, AutoDrive-QA contributes to more transparent benchmarking of urban AI systems, supporting the development of safer and more trustworthy autonomous driving technologies for smart cities.

AutoDrive-QA: A Multiple-Choice Benchmark for Vision-Language Evaluation in Urban Autonomous Driving

TL;DR

AutoDrive-QA tackles the inconsistent evaluation of vision-language models in autonomous urban driving by converting open-ended QA from DriveLM, NuScenes-QA, and LingoQA into structured multiple-choice questions with domain-grounded distractors. The authors design an end-to-end pipeline (dataset conversion, distractor generation, filtering) and a multi-stage quality-control process to ensure a single correct answer and challenging options across perception, prediction, and planning. Empirical results show that task-specific fine-tuning (LLaVA-1.5-7B) yields ~6 percentage point gains, while GPT-4V achieves the strongest zero-shot performance; however, traditional BLEU and CIDEr metrics remain poorly aligned with task success. The work provides a reproducible, interpretable benchmarking framework that highlights common driving errors and offers a foundation for scalable, domain-aware evaluation of urban AI systems.

Abstract

Evaluating vision-language models (VLMs) in urban driving contexts remains challenging, as existing benchmarks rely on open-ended responses that are ambiguous, annotation-intensive, and inconsistent to score. This lack of standardized evaluation slows progress toward safe and reliable AI for urban mobility. We introduce AutoDrive-QA, the first benchmark that systematically converts open-ended driving QA datasets (DriveLM, NuScenes-QA, LingoQA) into structured multiple-choice questions (MCQs) with distractors grounded in five realistic error categories: Driving Domain Misconceptions, Logical Inconsistencies, Misinterpreted Sensor Inputs, Computational Oversights, and Question Ambiguity. This framework enables reproducible and interpretable evaluation of VLMs across perception, prediction, and planning tasks in complex urban scenes. Experiments show that fine-tuning LLaVA-1.5-7B improves accuracy by about six percentage points across tasks, GPT-4V achieves the strongest zero-shot performance with up to 69.8% accuracy, and Qwen2-VL models also perform competitively, particularly in multi-view settings. Moreover, traditional metrics such as BLEU and CIDEr fail to distinguish strong from weak models. By providing an objective, domain-grounded evaluation protocol, AutoDrive-QA contributes to more transparent benchmarking of urban AI systems, supporting the development of safer and more trustworthy autonomous driving technologies for smart cities.

Paper Structure

This paper contains 17 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The AutoDrive-QA pipeline for automated MCQ generation in autonomous driving
  • Figure 2: Paraphrased prompt for generating distractors targeting visual interpretation errors, including driving-specific cases (misinterpreted sensor inputs).
  • Figure 3: Reviewer prompt for analyzing and refining distractors in autonomous driving scenarios, ensuring they remain plausible yet incorrect while improving difficulty.
  • Figure 4: Unified refinement prompt for improving distractors in autonomous driving multiple-choice questions.
  • Figure 5: Evaluator prompt for checking correctness of driving-scene distractors, ensuring only one valid answer is possible.
  • ...and 1 more figures