Table of Contents
Fetching ...

VLP: Vision Language Planning for Autonomous Driving

Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, Liu Ren

TL;DR

VLP is presented, a novel Vision-Language-Planningframework that exploits language models to bridge the gap between linguistic understanding and autonomous driving and shows improved performance in challenging long-tail scenarios and strong generalization capabilities when faced with new urban environments.

Abstract

Autonomous driving is a complex and challenging task that aims at safe motion planning through scene understanding and reasoning. While vision-only autonomous driving methods have recently achieved notable performance, through enhanced scene understanding, several key issues, including lack of reasoning, low generalization performance and long-tail scenarios, still need to be addressed. In this paper, we present VLP, a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. VLP enhances autonomous driving systems by strengthening both the source memory foundation and the self-driving car's contextual understanding. VLP achieves state-of-the-art end-to-end planning performance on the challenging NuScenes dataset by achieving 35.9\% and 60.5\% reduction in terms of average L2 error and collision rates, respectively, compared to the previous best method. Moreover, VLP shows improved performance in challenging long-tail scenarios and strong generalization capabilities when faced with new urban environments.

VLP: Vision Language Planning for Autonomous Driving

TL;DR

VLP is presented, a novel Vision-Language-Planningframework that exploits language models to bridge the gap between linguistic understanding and autonomous driving and shows improved performance in challenging long-tail scenarios and strong generalization capabilities when faced with new urban environments.

Abstract

Autonomous driving is a complex and challenging task that aims at safe motion planning through scene understanding and reasoning. While vision-only autonomous driving methods have recently achieved notable performance, through enhanced scene understanding, several key issues, including lack of reasoning, low generalization performance and long-tail scenarios, still need to be addressed. In this paper, we present VLP, a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. VLP enhances autonomous driving systems by strengthening both the source memory foundation and the self-driving car's contextual understanding. VLP achieves state-of-the-art end-to-end planning performance on the challenging NuScenes dataset by achieving 35.9\% and 60.5\% reduction in terms of average L2 error and collision rates, respectively, compared to the previous best method. Moreover, VLP shows improved performance in challenging long-tail scenarios and strong generalization capabilities when faced with new urban environments.
Paper Structure (22 sections, 8 equations, 9 figures, 17 tables)

This paper contains 22 sections, 8 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: New-city generalization ability of ADS for planning is evaluated by training on Boston city and testing on Singapore, and vice versa. Our proposed VLP shows strong generalization ability by significantly outperforming state-of-the-art vision-only method, VADVAD, in terms of both L2 error and collision rate.
  • Figure 2: a) The overview of proposed vision language planning (VLP) framework. VLP enhances ADS from self-driving BEV-reasoning and self-driving decision-making aspects, through two innovative modules, ALP and SLP, respectively. Leveraging LM and contrastive learning, ALP conducts agent-wise learning for refining local details on BEV, while SLP engages sample-wise learning for advancing global context understanding ability of the ADS. VLP is only active during training, ensuring no additional parameters or computations are introduced during inference. b) Prompt formats used in VLP.
  • Figure 3: Qualitative comparison between UniAD and Ours. Green arrow is used to highlight areas where our VLP outperforms the baseline. The results indicate that our VLP enables the self-driving car to navigate more efficiently and safely.
  • Figure 4: Qualitative comparison between UniAD and Ours. Green arrow highlights areas where our VLP outperforms the baseline.
  • Figure 5: Qualitative comparison between UniAD and Ours. Green arrow highlights areas where our VLP outperforms the baseline.
  • ...and 4 more figures