Table of Contents
Fetching ...

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu

TL;DR

MindDriver is innovatively proposed, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving, and presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning.

Abstract

Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

TL;DR

MindDriver is innovatively proposed, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving, and presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning.

Abstract

Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.
Paper Structure (19 sections, 8 equations, 12 figures, 10 tables)

This paper contains 19 sections, 8 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Comparison of different reasoning methods. Text reasoning struggles with space misalignment, while image reasoning suffers from guideless image prediction. Our proposed progressive multimodal reasoning conducts aligned smooth reasoning.
  • Figure 2: Overview. (Left) Framework of MindDriver. MindDriver conducts the perception-imagination-action process for accurate trajectory planning. (Right) (Top) Reasoning data annotation pipeline. The progressive multimodal reasoning data is auto-annotated by both rule-based and model-based filtering and feedback-guided regeneration. (Bottom) Progressive reinforcement fine-tuning is applied to enhance the progressive reasoning process.
  • Figure 3: Auto-annotation pipeline for progressive multimodal reasoning training data. Qwen2.5-VL-72B first annotates raw CoT, which is then filtered based on format, decision, and logic. Failed cases are re-annotated using error feedback to improve generation quality.
  • Figure 4: Qualitative comparison of MindDriver with baselines. (Left) Three scenarios from the open-loop nuScenes benchmark. The red trajectory is the prediction and the green one is the GT. (Right) The performance variation with timestamps on closed-loop Bench2Drive.
  • Figure 5: Prompt for logical verification to Qwen3-235B-A22B-Instruct.
  • ...and 7 more figures