Table of Contents
Fetching ...

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, Li Zhang

TL;DR

Reason2Drive introduces a large-scale, chain-based reasoning benchmark for autonomous driving and proposes ADRScore to evaluate step-wise reasoning rather than just text generation. The paper presents a framework with a prior tokenizer and an instructed vision decoder to leverage object-level perceptual priors, improving spatial localization and reasoning for driving scenes. Through extensive experiments, the approach outperforms baselines on reasoning metrics and demonstrates better generalization to unseen data, advancing interpretable reasoning for downstream planning. The dataset and methods are poised to enable more reliable, interpretable autonomous systems and will be released for community use.

Abstract

Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior. Despite their potential, research in autonomous systems is hindered by the lack of datasets with annotated reasoning chains that explain the decision-making processes in driving. To bridge this gap, we present Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving environments. We distinctly characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps, and the question-answer pairs are automatically collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo and ONCE. Moreover, we introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr. Based on the proposed benchmark, we conduct experiments to assess various existing VLMs, revealing insights into their reasoning capabilities. Additionally, we develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be released.

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

TL;DR

Reason2Drive introduces a large-scale, chain-based reasoning benchmark for autonomous driving and proposes ADRScore to evaluate step-wise reasoning rather than just text generation. The paper presents a framework with a prior tokenizer and an instructed vision decoder to leverage object-level perceptual priors, improving spatial localization and reasoning for driving scenes. Through extensive experiments, the approach outperforms baselines on reasoning metrics and demonstrates better generalization to unseen data, advancing interpretable reasoning for downstream planning. The dataset and methods are poised to enable more reliable, interpretable autonomous systems and will be released for community use.

Abstract

Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior. Despite their potential, research in autonomous systems is hindered by the lack of datasets with annotated reasoning chains that explain the decision-making processes in driving. To bridge this gap, we present Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving environments. We distinctly characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps, and the question-answer pairs are automatically collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo and ONCE. Moreover, we introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr. Based on the proposed benchmark, we conduct experiments to assess various existing VLMs, revealing insights into their reasoning capabilities. Additionally, we develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be released.
Paper Structure (29 sections, 14 equations, 12 figures, 10 tables)

This paper contains 29 sections, 14 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: (a) Different decision-making processes in autonomous driving. (b) Language-based dataset comparison.
  • Figure 2: Schema of Our Reason2Drive Dataset. The upper part illustrates the pipeline for the automated construction of datasets. The lower part shows detailed instances of perception, prediction, and reasoning, accompanied by outcomes after applying GPT-4 for data augmentation. The special tokens hold distinct definitions: <Inst*> represents a specified instance, <MOT> signifies a forecasted sequence of trajectory coordinates, and <LOC> denotes positional coordinates. The colors associated with these tokens correspond to the highlighted objects in the upper-left image's boxes.
  • Figure 3: The comparison between our Reason2Drive dataset and other prompt-based datasets. $\blacksquare$ means dataset not published.
  • Figure 4: Data quality comparison. Reason2Drive is larger in scale, richer in data content, and more diverse in scenarios.
  • Figure 5: Statistical distribution of different tasks in Reason2Drive.
  • ...and 7 more figures