Table of Contents
Fetching ...

MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

Bin Sun, Yaoguang Cao, Yan Wang, Rui Wang, Jiachen Shang, Xiejie Feng, Jiayi Lu, Jia Shi, Shichun Yang, Xiaoyu Yan, Ziying Song

TL;DR

MindDrive addresses the imbalance in end-to-end autonomous driving between generating high-quality trajectory candidates and robustly evaluating them. It couples a World Action Model–driven Future-aware Trajectory Generator with a Vision–Language–oriented Evaluator to enable forward-looking, multi-objective decision making that aligns with human reasoning. The approach introduces the FaTG with a WAM for ego-conditioned scene rollouts and the VLoE featuring LaST-Former and VLM-Critic for semantic, language-guided scoring across safety, comfort, and efficiency. Extensive NAVSIM benchmarks show state-of-the-art performance, strong robustness under safety-critical and synthetic conditions, and clear ablation-based evidence that both components are essential for reliable, interpretable planning in autonomous driving.

Abstract

End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.

MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

TL;DR

MindDrive addresses the imbalance in end-to-end autonomous driving between generating high-quality trajectory candidates and robustly evaluating them. It couples a World Action Model–driven Future-aware Trajectory Generator with a Vision–Language–oriented Evaluator to enable forward-looking, multi-objective decision making that aligns with human reasoning. The approach introduces the FaTG with a WAM for ego-conditioned scene rollouts and the VLoE featuring LaST-Former and VLM-Critic for semantic, language-guided scoring across safety, comfort, and efficiency. Extensive NAVSIM benchmarks show state-of-the-art performance, strong robustness under safety-critical and synthetic conditions, and clear ablation-based evidence that both components are essential for reliable, interpretable planning in autonomous driving.

Abstract

End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.

Paper Structure

This paper contains 27 sections, 15 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Comparison between previous paradigms and our MindDrive framework.(a) Trajectory generation-oriented methods uniadchen2024vadv2sun2025sparsedrivesong2025don invest heavily in producing diverse trajectories but rely on weak selectors—typically simple MLP or softmax heads—leading to suboptimal final decisions. (b) Trajectory selection-oriented methods li2024hydrawotezheng2025simplevsf provide multi-metric evaluation (safety, efficiency, comfort) yet depend on limited candidate generation, limiting overall planning performance. (c) Our proposed MindDrive integrates world-model–based trajectory generation with VLM-driven multi-objective reasoning, achieving high-quality generation and comprehensive selection in a harmonized design.
  • Figure 2: Overview of the MindDrive framework. MindDrive integrates world models and vision–language models, bringing together high-quality trajectory generation and comprehensive decision reasoning in a harmonized design.In the perception module, multi-view camera and LiDAR inputs are fused into BEV features, while ego states and initial action intents are extracted as an Ego Representation. The Future-aware Trajectory Generator (FaTG) embeds the Ego Representation into the BEV features to construct scene variants, and then performs “what-if’’ simulations over them using a World Action Model (WAM) to model and predict their future evolutions. Subsequently, the VLM-oriented Evaluator (VLoE) first uses the LaST-Former to process multimodal tokens from the prompt and from the FaTG, generating a reasoning token. It then applies a VLM-Critic to score each trajectory on safety, comfort, efficiency, and compliance. The final trajectory is selected based on the aggregated multi-objective score.
  • Figure 3: World Action Model (WAM). The module follows a spatial–temporal–spatial sandwich design, where the Transformer models spatial dependencies and Mamba captures temporal dynamics, enabling progressive spatial encoding, temporal rollout, and spatial reconstruction of future scene representations. During training, WAM is supervised by current and future BEV semantic map features generated from the simulator, which serve as ground-truth scene-variant representations for scene rollout learning.
  • Figure 4: VLM-oriented Evaluator (VLoE). The LaST-Former fuses the language token from the prompt with trajectory and scene tokens from the FaTG through sentinel insertion mechanism, aligning their semantics and producing a unified reasoning tokens. The VLM-Critic extends a VLM with an additional score token that aggregates scoring-related features into critic hidden states, which the score head converts into multi-objective trajectory scores.
  • Figure 5: Qualitative comparison between TransFuser TransFuser and our method on challenging Navsafe scenarios. (a) Intersection: TransFuser exhibits trajectory deviations and unstable turning behavior, whereas our method produces smoother and lane-consistent plans. (b) Dense-traffic: under heavy flow, TransFuser predictions become jittery and drift toward surrounding vehicles; while our method maintains stable, congestion-aware trajectories.