MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

Lingjun Zhang; Yujian Yuan; Changjie Wu; Xinyuan Chang; Xin Cai; Shuang Zeng; Linzhe Shi; Sijin Wang; Hang Zhang; Mu Xu

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu

TL;DR

MindDriver is innovatively proposed, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving, and presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning.

Abstract

Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

TL;DR

Abstract

Paper Structure (19 sections, 8 equations, 12 figures, 10 tables)

This paper contains 19 sections, 8 equations, 12 figures, 10 tables.

Introduction
Related Work
End-to-End Autonomous Driving
MLLM for Autonomous Driving
Reinforcement Fine-tuning
MindDriver
Progressive Multimodal Reasoning Framework
Feedback-Guided Data Auto-annotation
Progressive Reinforcement Fine-tuning
Experiments
Experiment settings
Main results
Ablation Study
Qualitative Visualization
Conclusion
...and 4 more sections

Figures (12)

Figure 1: Comparison of different reasoning methods. Text reasoning struggles with space misalignment, while image reasoning suffers from guideless image prediction. Our proposed progressive multimodal reasoning conducts aligned smooth reasoning.
Figure 2: Overview. (Left) Framework of MindDriver. MindDriver conducts the perception-imagination-action process for accurate trajectory planning. (Right) (Top) Reasoning data annotation pipeline. The progressive multimodal reasoning data is auto-annotated by both rule-based and model-based filtering and feedback-guided regeneration. (Bottom) Progressive reinforcement fine-tuning is applied to enhance the progressive reasoning process.
Figure 3: Auto-annotation pipeline for progressive multimodal reasoning training data. Qwen2.5-VL-72B first annotates raw CoT, which is then filtered based on format, decision, and logic. Failed cases are re-annotated using error feedback to improve generation quality.
Figure 4: Qualitative comparison of MindDriver with baselines. (Left) Three scenarios from the open-loop nuScenes benchmark. The red trajectory is the prediction and the green one is the GT. (Right) The performance variation with timestamps on closed-loop Bench2Drive.
Figure 5: Prompt for logical verification to Qwen3-235B-A22B-Instruct.
...and 7 more figures

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

TL;DR

Abstract

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (12)