Table of Contents
Fetching ...

Vision Language Models in Autonomous Driving: A Survey and Outlook

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, Alois C. Knoll

TL;DR

This survey addresses the integration of Vision-Language Models (VLMs) and Large Language Models (LLMs) into autonomous driving by mapping five AD dimensions—perception, navigation, decision-making, end-to-end AD, and data generation—and detailing a taxonomy of VLM types (M2T, M2V, V2T) and inter-modality connectivities (Vision-Text-Fusion, Vision-Text-Matching). It surveys five core VLM tasks in AD (OR/OR-T, open-vocabulary perception, traffic scene understanding, language-guided navigation, conditional data generation), catalogs representative methods and datasets, and highlights performance metrics across tasks. The paper also inventories autonomous-driving and language-enhanced datasets, analyzes contemporary approaches, and discusses practical considerations for deployment, including foundation models, multi-modality adapters, and cooperative driving systems. Finally, it identifies major challenges—computation latency, temporal scene understanding, ethics, and privacy—and outlines future directions to advance safe, interpretable, and scalable VLM-enabled autonomous driving.

Abstract

The applications of Vision-Language Models (VLMs) in the field of Autonomous Driving (AD) have attracted widespread attention due to their outstanding performance and the ability to leverage Large Language Models (LLMs). By incorporating language data, driving systems can gain a better understanding of real-world environments, thereby enhancing driving safety and efficiency. In this work, we present a comprehensive and systematic survey of the advances in vision language models in this domain, encompassing perception and understanding, navigation and planning, decision-making and control, end-to-end autonomous driving, and data generation. We introduce the mainstream VLM tasks in AD and the commonly utilized metrics. Additionally, we review current studies and applications in various areas and summarize the existing language-enhanced autonomous driving datasets thoroughly. Lastly, we discuss the benefits and challenges of VLMs in AD and provide researchers with the current research gaps and future trends.

Vision Language Models in Autonomous Driving: A Survey and Outlook

TL;DR

This survey addresses the integration of Vision-Language Models (VLMs) and Large Language Models (LLMs) into autonomous driving by mapping five AD dimensions—perception, navigation, decision-making, end-to-end AD, and data generation—and detailing a taxonomy of VLM types (M2T, M2V, V2T) and inter-modality connectivities (Vision-Text-Fusion, Vision-Text-Matching). It surveys five core VLM tasks in AD (OR/OR-T, open-vocabulary perception, traffic scene understanding, language-guided navigation, conditional data generation), catalogs representative methods and datasets, and highlights performance metrics across tasks. The paper also inventories autonomous-driving and language-enhanced datasets, analyzes contemporary approaches, and discusses practical considerations for deployment, including foundation models, multi-modality adapters, and cooperative driving systems. Finally, it identifies major challenges—computation latency, temporal scene understanding, ethics, and privacy—and outlines future directions to advance safe, interpretable, and scalable VLM-enabled autonomous driving.

Abstract

The applications of Vision-Language Models (VLMs) in the field of Autonomous Driving (AD) have attracted widespread attention due to their outstanding performance and the ability to leverage Large Language Models (LLMs). By incorporating language data, driving systems can gain a better understanding of real-world environments, thereby enhancing driving safety and efficiency. In this work, we present a comprehensive and systematic survey of the advances in vision language models in this domain, encompassing perception and understanding, navigation and planning, decision-making and control, end-to-end autonomous driving, and data generation. We introduce the mainstream VLM tasks in AD and the commonly utilized metrics. Additionally, we review current studies and applications in various areas and summarize the existing language-enhanced autonomous driving datasets thoroughly. Lastly, we discuss the benefits and challenges of VLMs in AD and provide researchers with the current research gaps and future trends.
Paper Structure (23 sections, 5 equations, 9 figures, 3 tables)

This paper contains 23 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Vision-Language Models and Large Language Models offer advancements in traditional tasks and pave the way for innovative applications in AD.
  • Figure 2: An Overview of the Taxonomy of Vision-Language Models Tasks and Applications in Autonomous Driving. This paper encompasses five major aspects of autonomous driving, i.e., Perception and understanding, Navigation and planning, decision-making and control, end-to-end autonomous driving, and data generation. We illustrate the main tasks and techniques inside the dashed rectangular boxes for each of these five dimensions.
  • Figure 3: Two inter-modality connection approaches of Vision-Language Models in Autonomous Driving: (a) Vision-text matching. We demonstrate the semantic similarity matching in the top-right of this figure. (b) Vision-text fusion. The fused features can be used for downstream tasks. The figure is from the KITTI geiger2013vision dataset.
  • Figure 4: Overview of mainstream Vision-Language Models in Autonomous Driving. (a) Multimodal-to-text models take text and image or video as input and generate text, as in xu2023drivegpt4. (b) Multimodal-to-text models take text and point clouds as input and generate text, as in LiDAR-LLM. (c) Vision-to-text models accept video or image as input and produce text as output, e.g. GAIA-1 Hu2023GAIA-1. (d) Multimodal-to-vision models take image and text as input and output image or video, depicted with jin2023adapt.
  • Figure 5: Example comparison between classic multiple object tracking task (left) and referred object tracking task (right). The image sequences are extracted from the rear camera of the nuScenes caesar2020nuscenes Dataset.
  • ...and 4 more figures