Table of Contents
Fetching ...

BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding

TL;DR

BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs, enables LLMs to reason more effectively in cross-view driving scenes, and significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.

Abstract

The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.

BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

TL;DR

BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs, enables LLMs to reason more effectively in cross-view driving scenes, and significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.

Abstract

The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.
Paper Structure (49 sections, 2 equations, 8 figures, 10 tables)

This paper contains 49 sections, 2 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Representation Comparison: Left: Vision encoders can leverage widely available semantically rich image-text data, but process multi-view images independently. Center: Bird's-Eye View (BEV) encoders provide a spatially consistent scene representation, but are limited to geometrically annotated data. Right (ours): We propose the semantic distillation from LLMs to BEV encoders to build a semantic-enhanced and spatially consistent scene representation.
  • Figure 2: Representation Study. We compare between (1) I$_{\text{ViT}}$, visual tokens extracted from the Vision Transformer of the original VLM; (2) I$_{\text{UniAD}}$, visual tokens from the backbone before the BEV fusion; and (3) B$_{\text{UniAD}}$, BEV tokens produced from the same backbone after the BEV fusion. The input language question is the same, but not visualized here for simplicity.
  • Figure 3: BEV Semantic Distillation: We distill the knowledge from the language model to the BEV representations by using Visual Question Answering (VQA) tasks while regularizing BEV spatial structure using the original object detection tasks.
  • Figure 4: Qualitative NeuroNCAP Results. Two representative closed-loop planning scenarios are presented for comparison between the baseline and semantically distilled models. The distilled model demonstrates improved decision-making under safety-critical scenarios, successfully performing a safe right turn in corner case 1 and an evasive lane change to the free right lane in corner case 2 to avoid potential collisions, where the baseline model fails.
  • Figure A.1: DriveLM Dataset Object Frequency. We visualize the object frequency in the Perception object-existence related questions, which are used to compute accuracy in \ref{['tab:mlp_projector']} and \ref{['tab:side_by_side_vqa']} in the main paper.
  • ...and 3 more figures