Table of Contents
Fetching ...

Bench2Drive-VL: Benchmarks for Closed-Loop Autonomous Driving with Vision-Language Models

Xiaosong Jia, Yuqian Shao, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, Junchi Yan

Abstract

With the rise of vision-language models (VLM), their application for autonomous driving (VLM4AD) has gained significant attention. Meanwhile, in autonomous driving, closed-loop evaluation has become widely recognized as a more reliable validation method than open-loop evaluation, as it can evaluate the performance of the model under cumulative errors and out-of-distribution inputs. However, existing VLM4AD benchmarks evaluate the model`s scene understanding ability under open-loop, i.e., via static question-answer (QA) dataset. This kind of evaluation fails to assess the VLMs performance under out-of-distribution states rarely appeared in the human collected datasets.To this end, we present Bench2Drive-VL, an extension of Bench2Drive that brings closed-loop evaluation to VLM-based driving, which introduces: (1) DriveCommenter, a closed-loop generator that automatically generates diverse, behavior-grounded question-answer pairs for all driving situations in CARLA,including severe off-route and off-road deviations previously unassessable in simulation. (2) A unified protocol and interface that allows modern VLMs to be directly plugged into the Bench2Drive closed-loop environment to compare with traditional agents. (3) A flexible reasoning and control framework, supporting multi-format visual inputs and configurable graph-based chain-of-thought execution. (4) A complete development ecosystem. Together, these components form a comprehensive closed-loop benchmark for VLM4AD. All codes and annotated datasets are open sourced.

Bench2Drive-VL: Benchmarks for Closed-Loop Autonomous Driving with Vision-Language Models

Abstract

With the rise of vision-language models (VLM), their application for autonomous driving (VLM4AD) has gained significant attention. Meanwhile, in autonomous driving, closed-loop evaluation has become widely recognized as a more reliable validation method than open-loop evaluation, as it can evaluate the performance of the model under cumulative errors and out-of-distribution inputs. However, existing VLM4AD benchmarks evaluate the model`s scene understanding ability under open-loop, i.e., via static question-answer (QA) dataset. This kind of evaluation fails to assess the VLMs performance under out-of-distribution states rarely appeared in the human collected datasets.To this end, we present Bench2Drive-VL, an extension of Bench2Drive that brings closed-loop evaluation to VLM-based driving, which introduces: (1) DriveCommenter, a closed-loop generator that automatically generates diverse, behavior-grounded question-answer pairs for all driving situations in CARLA,including severe off-route and off-road deviations previously unassessable in simulation. (2) A unified protocol and interface that allows modern VLMs to be directly plugged into the Bench2Drive closed-loop environment to compare with traditional agents. (3) A flexible reasoning and control framework, supporting multi-format visual inputs and configurable graph-based chain-of-thought execution. (4) A complete development ecosystem. Together, these components form a comprehensive closed-loop benchmark for VLM4AD. All codes and annotated datasets are open sourced.

Paper Structure

This paper contains 19 sections, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Overview of the Bench2Drive-VL. (i) The expert model DriveCommenter leverages privileged information to generate ground-truth answers for vision-language questions. (ii) The VLM under evaluation takes annotated sensor data as input, answers the same questions, and controls the ego vehicle. (iii) A VQA evaluator compares the VLM's responses to the ground truth provided by DriveCommenter, while the action module converts the VLM’s natural language instructions into control signals for vehicle actuation.
  • Figure 2: Overview of the Bench2Drive-VL closed-loop evaluation framework. The expert model DriveCommenter leverages privileged information to generate ground-truth answers for vision-language questions. The VLM under evaluation takes sensor data as input, answers the same questions, and controls the ego vehicle. A VQA evaluator compares the VLM's responses to the ground truth provided by DriveCommenter, while the action module converts the VLM’s natural language instructions into control signals.
  • Figure 3: Annotation Process of Bench2Drive-VL.
  • Figure 4: Visualization Tools of Bench2Drive-VL.
  • Figure 5: Graph-of-Thought of Bench2Drive-VL.
  • ...and 13 more figures