Table of Contents
Fetching ...

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, Li Zhang

TL;DR

ImagiDrive is a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop and introduces an early stopping mechanism and a trajectory selection strategy to address efficiency and predictive accuracy challenges inherent in this integration.

Abstract

Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent's planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

TL;DR

ImagiDrive is a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop and introduces an early stopping mechanism and a trajectory selection strategy to address efficiency and predictive accuracy challenges inherent in this integration.

Abstract

Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent's planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.

Paper Structure

This paper contains 13 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of autonomous driving paradigms. (a) VLM-based end-to-end methods may produce an effective planning strategy to avoid potential collisions. (b) DWMs predict and generate future scenarios (T+2 seconds) to identify potential hazards. (c) Our proposed framework, ImagiDrive, integrates both paradigms: Using future scene imagination from the DWM-based scene imaginer to iteratively refine VLM-based policy decisions and enhance safety.
  • Figure 2: Overview of ImagiDrive. Overview of our system, which includes a driving agent, a scene imaginer, and a trajectory buffer. It operates in two modes: ImagiDrive-A is a standard planning model that uses only the driving agent, while ImagiDrive-S adopts an imagination-and-planning loop, where the scene imaginer generates future frames based on past observations and predicted trajectories. These imagined frames are iteratively fed back to refine planning. The trajectory buffer stores all trajectories, selects the best one, and decides early termination.
  • Figure 3: Overview of Driving agent. The agent takes multi-model inputs and produces both language and trajectory predictions.
  • Figure 4: Qualitative results in the closed-loop evaluation demonstrate that our ImagiDrive effectively avoids collisions in intersection side-encounter scenario.