MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving
Bin Sun, Yaoguang Cao, Yan Wang, Rui Wang, Jiachen Shang, Xiejie Feng, Jiayi Lu, Jia Shi, Shichun Yang, Xiaoyu Yan, Ziying Song
TL;DR
MindDrive addresses the imbalance in end-to-end autonomous driving between generating high-quality trajectory candidates and robustly evaluating them. It couples a World Action Model–driven Future-aware Trajectory Generator with a Vision–Language–oriented Evaluator to enable forward-looking, multi-objective decision making that aligns with human reasoning. The approach introduces the FaTG with a WAM for ego-conditioned scene rollouts and the VLoE featuring LaST-Former and VLM-Critic for semantic, language-guided scoring across safety, comfort, and efficiency. Extensive NAVSIM benchmarks show state-of-the-art performance, strong robustness under safety-critical and synthetic conditions, and clear ablation-based evidence that both components are essential for reliable, interpretable planning in autonomous driving.
Abstract
End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.
