Table of Contents
Fetching ...

WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, Chen Lv

TL;DR

This work introduces WiseAD, a knowledge-augmented vision-language framework for end-to-end autonomous driving that jointly learns driving knowledge and trajectory planning. By leveraging diverse driving-knowledge datasets and rendering trajectory plans as textual prompts, WiseAD achieves state-of-the-art closed-loop performance on CARLA and superior knowledge evaluation across in-domain and out-of-domain tasks. The key contributions include a joint learning recipe that mixes knowledge and planning data, an attention-prefix prompt to guide knowledge usage, and extensive experiments showing improvements in safety, route completion, and QA proficiency. The approach demonstrates the practical value of grounding autonomous driving decisions in explicit driving expertise and reasoning.

Abstract

The emergence of general human knowledge and impressive logical reasoning capacity in rapidly progressed vision-language models (VLMs) have driven increasing interest in applying VLMs to high-level autonomous driving tasks, such as scene understanding and decision-making. However, an in-depth study on the relationship between knowledge proficiency, especially essential driving expertise, and closed-loop autonomous driving performance requires further exploration. In this paper, we investigate the effects of the depth and breadth of fundamental driving knowledge on closed-loop trajectory planning and introduce WiseAD, a specialized VLM tailored for end-to-end autonomous driving capable of driving reasoning, action justification, object recognition, risk analysis, driving suggestions, and trajectory planning across diverse scenarios. We employ joint training on driving knowledge and planning datasets, enabling the model to perform knowledge-aligned trajectory planning accordingly. Extensive experiments indicate that as the diversity of driving knowledge extends, critical accidents are notably reduced, contributing 11.9% and 12.4% improvements in the driving score and route completion on the Carla closed-loop evaluations, achieving state-of-the-art performance. Moreover, WiseAD also demonstrates remarkable performance in knowledge evaluations on both in-domain and out-of-domain datasets.

WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

TL;DR

This work introduces WiseAD, a knowledge-augmented vision-language framework for end-to-end autonomous driving that jointly learns driving knowledge and trajectory planning. By leveraging diverse driving-knowledge datasets and rendering trajectory plans as textual prompts, WiseAD achieves state-of-the-art closed-loop performance on CARLA and superior knowledge evaluation across in-domain and out-of-domain tasks. The key contributions include a joint learning recipe that mixes knowledge and planning data, an attention-prefix prompt to guide knowledge usage, and extensive experiments showing improvements in safety, route completion, and QA proficiency. The approach demonstrates the practical value of grounding autonomous driving decisions in explicit driving expertise and reasoning.

Abstract

The emergence of general human knowledge and impressive logical reasoning capacity in rapidly progressed vision-language models (VLMs) have driven increasing interest in applying VLMs to high-level autonomous driving tasks, such as scene understanding and decision-making. However, an in-depth study on the relationship between knowledge proficiency, especially essential driving expertise, and closed-loop autonomous driving performance requires further exploration. In this paper, we investigate the effects of the depth and breadth of fundamental driving knowledge on closed-loop trajectory planning and introduce WiseAD, a specialized VLM tailored for end-to-end autonomous driving capable of driving reasoning, action justification, object recognition, risk analysis, driving suggestions, and trajectory planning across diverse scenarios. We employ joint training on driving knowledge and planning datasets, enabling the model to perform knowledge-aligned trajectory planning accordingly. Extensive experiments indicate that as the diversity of driving knowledge extends, critical accidents are notably reduced, contributing 11.9% and 12.4% improvements in the driving score and route completion on the Carla closed-loop evaluations, achieving state-of-the-art performance. Moreover, WiseAD also demonstrates remarkable performance in knowledge evaluations on both in-domain and out-of-domain datasets.

Paper Structure

This paper contains 16 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An overview of the proposed WiseAD, a specialized vision-language model for end-to-end autonomous driving with extensive fundamental driving knowledge. Given a clip of the video sequence, our WiseAD is capable of answering various driving-related questions and performing knowledge-augmented trajectory planning according to the target waypoint.
  • Figure 2: The framework of the WiseAD. Our model is built upon the MobileVLM and takes video sequences and textual prompts as input. The output for corresponding answers is unified into the linguistic expression to leverage the logical reasoning capability in vision-language models.
  • Figure 3: Qualitative comparison with the InternVL-8B for driving knowledge evaluation.