BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving
Katharina Winter, Mark Azer, Fabian B. Flohr
TL;DR
BEVDriver introduces an end-to-end, LLM-guided motion planner that uses a BEV-based latent representation fused from LiDAR and multi-view cameras to ground natural language instructions in low-level waypoint prediction. By aligning latent BEV features with NL space through a Q-Former and employing a LoRA-adapted LLM, it directly predicts future waypoints and executes with PID control, achieving state-of-the-art results on the LangAuto benchmark with substantial improvements in Driving Score and Route Completion. The work provides comprehensive ablations, demonstrations of open-loop and closed-loop performance, and releases datasets and code to support reproducibility and further research in language-guided autonomous driving. It also highlights challenges in temporal alignment and instruction distance, outlining paths to improved temporal reasoning and explainability in future iterations.
Abstract
Autonomous driving has the potential to set the stage for more efficient future mobility, requiring the research domain to establish trust through safe, reliable and transparent driving. Large Language Models (LLMs) possess reasoning capabilities and natural language understanding, presenting the potential to serve as generalized decision-makers for ego-motion planning that can interact with humans and navigate environments designed for human drivers. While this research avenue is promising, current autonomous driving approaches are challenged by combining 3D spatial grounding and the reasoning and language capabilities of LLMs. We introduce BEVDriver, an LLM-based model for end-to-end closed-loop driving in CARLA that utilizes latent BEV features as perception input. BEVDriver includes a BEV encoder to efficiently process multi-view images and 3D LiDAR point clouds. Within a common latent space, the BEV features are propagated through a Q-Former to align with natural language instructions and passed to the LLM that predicts and plans precise future trajectories while considering navigation instructions and critical scenarios. On the LangAuto benchmark, our model reaches up to 18.9% higher performance on the Driving Score compared to SoTA methods.
