Table of Contents
Fetching ...

BEV-VLM: Trajectory Planning via Unified BEV Abstraction

Guancheng Chen, Sheng Yang, Tong Zhan, Jian Wang

TL;DR

BEV-VLM tackles trajectory planning for autonomous driving by using a Vision-Language Model that reasons over a unified BEV-HD Map, which fuses BEV features from multi-modal sensors with HD Maps. The approach converts high-dimensional sensor data into a compact BEV feature map and aligns it with HD Maps to form BEV-HD Map inputs for a VLM, enabling autoregressive waypoint generation over a horizon of $3s$ with $0.5s$ steps. It reports a $44.8\%$ reduction in displacement error compared to vision-only baselines and complete collision avoidance on nuScenes, with ablations showing the critical value of BEV features and HD Map alignment. The findings demonstrate that VLMs can effectively operate on processed BEV representations, broadening their applicability to multimodal, geometry-aware planning tasks.

Abstract

This paper introduces BEV-VLM, a novel framework for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual inputs. Unlike conventional approaches that rely solely on raw visual data such as camera images, our method utilizes highly compressed and informative BEV representations, which are generated by fusing multi-modal sensor data (e.g., camera and LiDAR) and aligning them with HD Maps. This unified BEV-HD Map format provides a geometrically consistent and rich scene description, enabling VLMs to perform accurate trajectory planning. Experimental results on the nuScenes dataset demonstrate 44.8% improvements in planning accuracy and complete collision avoidance. Our work highlights that VLMs can effectively interpret processed visual representations like BEV features, expanding their applicability beyond raw images in trajectory planning.

BEV-VLM: Trajectory Planning via Unified BEV Abstraction

TL;DR

BEV-VLM tackles trajectory planning for autonomous driving by using a Vision-Language Model that reasons over a unified BEV-HD Map, which fuses BEV features from multi-modal sensors with HD Maps. The approach converts high-dimensional sensor data into a compact BEV feature map and aligns it with HD Maps to form BEV-HD Map inputs for a VLM, enabling autoregressive waypoint generation over a horizon of with steps. It reports a reduction in displacement error compared to vision-only baselines and complete collision avoidance on nuScenes, with ablations showing the critical value of BEV features and HD Map alignment. The findings demonstrate that VLMs can effectively operate on processed BEV representations, broadening their applicability to multimodal, geometry-aware planning tasks.

Abstract

This paper introduces BEV-VLM, a novel framework for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual inputs. Unlike conventional approaches that rely solely on raw visual data such as camera images, our method utilizes highly compressed and informative BEV representations, which are generated by fusing multi-modal sensor data (e.g., camera and LiDAR) and aligning them with HD Maps. This unified BEV-HD Map format provides a geometrically consistent and rich scene description, enabling VLMs to perform accurate trajectory planning. Experimental results on the nuScenes dataset demonstrate 44.8% improvements in planning accuracy and complete collision avoidance. Our work highlights that VLMs can effectively interpret processed visual representations like BEV features, expanding their applicability beyond raw images in trajectory planning.

Paper Structure

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overall Architecture of BEV-VLM BEV-VLM encompasses two core modules. First, it acquires BEV features via pre-trained models, then fuses these BEV features with HD Maps to produce detailed scene descriptions. Second, the framework feeds these generated scene descriptions into VLM, leveraging the capabilities of model to execute trajectory prediction tasks.
  • Figure 2: Cross-Column Visualization of Trajectories Several different scenarios are displayed, including turn left, go straight, turn right, and more driving actions. Green denotes the predicted trajectories, while orange denotes the ground truth trajectories.
  • Figure 3: From left to right, this figure show the visualized BEV Feature Map, the spatially aligned HD Map, and the BEV-HD Map with precise overlay rendering.