Table of Contents
Fetching ...

BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving

Katharina Winter, Mark Azer, Fabian B. Flohr

TL;DR

BEVDriver introduces an end-to-end, LLM-guided motion planner that uses a BEV-based latent representation fused from LiDAR and multi-view cameras to ground natural language instructions in low-level waypoint prediction. By aligning latent BEV features with NL space through a Q-Former and employing a LoRA-adapted LLM, it directly predicts future waypoints and executes with PID control, achieving state-of-the-art results on the LangAuto benchmark with substantial improvements in Driving Score and Route Completion. The work provides comprehensive ablations, demonstrations of open-loop and closed-loop performance, and releases datasets and code to support reproducibility and further research in language-guided autonomous driving. It also highlights challenges in temporal alignment and instruction distance, outlining paths to improved temporal reasoning and explainability in future iterations.

Abstract

Autonomous driving has the potential to set the stage for more efficient future mobility, requiring the research domain to establish trust through safe, reliable and transparent driving. Large Language Models (LLMs) possess reasoning capabilities and natural language understanding, presenting the potential to serve as generalized decision-makers for ego-motion planning that can interact with humans and navigate environments designed for human drivers. While this research avenue is promising, current autonomous driving approaches are challenged by combining 3D spatial grounding and the reasoning and language capabilities of LLMs. We introduce BEVDriver, an LLM-based model for end-to-end closed-loop driving in CARLA that utilizes latent BEV features as perception input. BEVDriver includes a BEV encoder to efficiently process multi-view images and 3D LiDAR point clouds. Within a common latent space, the BEV features are propagated through a Q-Former to align with natural language instructions and passed to the LLM that predicts and plans precise future trajectories while considering navigation instructions and critical scenarios. On the LangAuto benchmark, our model reaches up to 18.9% higher performance on the Driving Score compared to SoTA methods.

BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving

TL;DR

BEVDriver introduces an end-to-end, LLM-guided motion planner that uses a BEV-based latent representation fused from LiDAR and multi-view cameras to ground natural language instructions in low-level waypoint prediction. By aligning latent BEV features with NL space through a Q-Former and employing a LoRA-adapted LLM, it directly predicts future waypoints and executes with PID control, achieving state-of-the-art results on the LangAuto benchmark with substantial improvements in Driving Score and Route Completion. The work provides comprehensive ablations, demonstrations of open-loop and closed-loop performance, and releases datasets and code to support reproducibility and further research in language-guided autonomous driving. It also highlights challenges in temporal alignment and instruction distance, outlining paths to improved temporal reasoning and explainability in future iterations.

Abstract

Autonomous driving has the potential to set the stage for more efficient future mobility, requiring the research domain to establish trust through safe, reliable and transparent driving. Large Language Models (LLMs) possess reasoning capabilities and natural language understanding, presenting the potential to serve as generalized decision-makers for ego-motion planning that can interact with humans and navigate environments designed for human drivers. While this research avenue is promising, current autonomous driving approaches are challenged by combining 3D spatial grounding and the reasoning and language capabilities of LLMs. We introduce BEVDriver, an LLM-based model for end-to-end closed-loop driving in CARLA that utilizes latent BEV features as perception input. BEVDriver includes a BEV encoder to efficiently process multi-view images and 3D LiDAR point clouds. Within a common latent space, the BEV features are propagated through a Q-Former to align with natural language instructions and passed to the LLM that predicts and plans precise future trajectories while considering navigation instructions and critical scenarios. On the LangAuto benchmark, our model reaches up to 18.9% higher performance on the Driving Score compared to SoTA methods.

Paper Structure

This paper contains 26 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We propose BEVDriver - an LLM-based end-to-end motion planner for closed-loop driving in CARLA. LiDAR and multi-view camera data are fused into BEV features based on which BEVDriver navigates the world following natural language navigation instructions combining high level planning and low level waypoint prediction.
  • Figure 2: Architecture of BEVDriver. Multi-view RGB images and 3D LiDAR point clouds are encoded into a BEV feature map, trained with object detection, semantic segmentation, traffic light detection and a self-supervised alignment loss. A Q-Former aligns the pre-trained latent features with the navigation instructions natural language space. A LoRA adapter feeds historical inputs to the LLM, which processes tokenized navigation instructions alongside perception data. The LLM outputs future waypoints, converted into driving commands by a PID controller, as well as scene descriptions and a boolean indicating instruction completion.
  • Figure 3: Two qualitative samples of the model driving in the CARLA simulator in a third-person and top-down view following the navigation instruction.
  • Figure 4: Semantic segmentation and BEV object detection for two scenes with the front view RGB image, ground truth semantic segmentation, semantic prediction, ground truth detection and detection prediction.