Table of Contents
Fetching ...

DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan

TL;DR

DriveLMM-o1 presents a step-by-step visual reasoning benchmark for autonomous driving, addressing the gap where existing VQA datasets emphasize final answers over interpretable reasoning. It provides a multimodal dataset with multiview images and LiDAR grounded in NuScenes, plus driving-specific evaluation metrics. The authors train a large multimodal model (fine-tuned InternVL2.5-8B with LoRA) on this data and show substantial gains in both final answer accuracy and reasoning score against open-source baselines. The work advances interpretable perception-prediction-planning in driving and offers a standardized benchmark for evaluating reasoning in high-stakes multimodal scenarios.

Abstract

While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1.

DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

TL;DR

DriveLMM-o1 presents a step-by-step visual reasoning benchmark for autonomous driving, addressing the gap where existing VQA datasets emphasize final answers over interpretable reasoning. It provides a multimodal dataset with multiview images and LiDAR grounded in NuScenes, plus driving-specific evaluation metrics. The authors train a large multimodal model (fine-tuned InternVL2.5-8B with LoRA) on this data and show substantial gains in both final answer accuracy and reasoning score against open-source baselines. The work advances interpretable perception-prediction-planning in driving and offers a standardized benchmark for evaluating reasoning in high-stakes multimodal scenarios.

Abstract

While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1.

Paper Structure

This paper contains 13 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Examples from our proposed DriveLMM-o1 dataset. Our proposed dataset is designed to promote step-by-step reasoning in autonomous driving scenarios, guiding models from understanding the driving task and scene context to making logical inferences based on visual and spatial cues, ultimately leading to accurate decision-making and answer generation. The middle section shows multiview images for two driving scenarios. In the top row, two examples of multiple-choice questions are displayed, along with the step-by-step reasoning annotation required to reach an accurate conclusion. The bottom row includes questions referring to key objects in the scene, with each object highlighted in its corresponding color.
  • Figure 2: An overview of the benchmark development process. We build our dataset upon frames and objects from NuScenes nuscenes2019. Initial reasoning and answers for a standard question set are generated using an LMM, followed by correction and verification of each sample by human annotators.
  • Figure 3: Qualitative Results: We present qualitative examples comparing the reasoning process and final answers generated by the baseline InternVL2.5-8B model and our finetuned model against the ground truth. The results highlight the critical role of accurate reasoning in arriving at the correct final answer, as our model demonstrates greater scene awareness and contextual understanding, leading to more precise and reliable decisions.
  • Figure 4: Qualitative Comparison against LlamaV-o1: Qualitative comparison is presented between our model's reasoning outputs and subsequent final answers and the recent Visual Reasoning model LlamaV-o1 thawakar2025llamav. While LlamaV-o1 performs well on multiple domains, it struggles with the domain-specific step-by-step logical reasoning required for complex autonomous driving scenarios.