Table of Contents
Fetching ...

A Survey on Multimodal Large Language Models for Autonomous Driving

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, Tianren Gao, Erlong Li, Kun Tang, Zhipeng Cao, Tong Zhou, Ao Liu, Xinrui Yan, Shuqi Mei, Jianguo Cao, Ziran Wang, Chao Zheng

TL;DR

The paper surveys the integration of Multimodal Large Language Models (MLLMs) into autonomous driving, mapping historical advances in autonomous systems to recent cross-modal language-vision capabilities. It surveys model development, perception-planning-control pipelines, and industry applications, supported by datasets and benchmarks; it also reports on the LLVM-AD workshop and its dataset initiatives (MAPLM, UCU). The authors highlight opportunities and challenges, including data scale, HD-map understanding, real-time hardware considerations, and safety/explainability, and they propose directions for future research and collaboration between academia and industry. Overall, the work articulates a roadmap for leveraging LLM-driven reasoning and multimodal perception to enhance safety, transparency, and user-centric control in next-generation autonomous driving systems.

Abstract

With the emergence of Large Language Models (LLMs) and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this paper, we present a systematic investigation in this field. We first introduce the background of Multimodal Large Language Models (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.

A Survey on Multimodal Large Language Models for Autonomous Driving

TL;DR

The paper surveys the integration of Multimodal Large Language Models (MLLMs) into autonomous driving, mapping historical advances in autonomous systems to recent cross-modal language-vision capabilities. It surveys model development, perception-planning-control pipelines, and industry applications, supported by datasets and benchmarks; it also reports on the LLVM-AD workshop and its dataset initiatives (MAPLM, UCU). The authors highlight opportunities and challenges, including data scale, HD-map understanding, real-time hardware considerations, and safety/explainability, and they propose directions for future research and collaboration between academia and industry. Overall, the work articulates a roadmap for leveraging LLM-driven reasoning and multimodal perception to enhance safety, transparency, and user-centric control in next-generation autonomous driving systems.

Abstract

With the emergence of Large Language Models (LLMs) and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this paper, we present a systematic investigation in this field. We first introduce the background of Multimodal Large Language Models (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.
Paper Structure (38 sections, 4 figures, 3 tables)

This paper contains 38 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: This survey paper focuses on the use of Multimodal Large Language Models (MLLMs) in the advancement of autonomous driving. The organization of the paper will delve into various aspects related to this topic.
  • Figure 2: Exploring GPT-4V gpt4v2023 to understand driving scenes and make driving actions. Our findings reveal that while GPT-4V adeptly identifies scene components such as objects, it falls short in recognizing critical traffic elements like lane information. This underscores the significant challenges yet to be overcome in advancing multimodal language models for reliable autonomous vehicle navigation.
  • Figure 3: The figure outlines the chronological development of autonomous driving technology. It begins with representative early exploration and advancements like the ALV Project by Carnegie Mellon University kanade_autonomous_1986robot_hall_of_fame, Mitsubishi Debonair the first to offer LiDAR-based ADAS system nabhan2020models, and winner of 2005 DARPA Grand Challenge Stanley by Stanford University Stanley. It then showcases recent achievements after the introduction of a standardized level of automation sae_j3016_2014 and rapid progress in Deep Neural Networks. Autonomous driving platform-wise, various open source and commercialized software solutions are introduced, such as Tesla Autopilot tesla_motors_manual, NVIDIA DRIVE, Autoware.AI autoware_1autoware_2, Baidu Apollo apolloautoapollo_2023, and PonyAlpha ponyai. Regulatory and service-wise, autonomous driving technology are receiving increasing government acceptance and public acknowledgment, with numerous companies receiving permits to operate autonomous driving vehicles on public roads in designated regions while more vehicles with autonomous driving capabilities are being mass-produced group_mercedes-benz_2023. Overall, it demonstrates the evolution and increasing sophistication of AD systems over several decades.
  • Figure 4: A timeline of recent advancements in Multimodal Large Language Models (MLLMs).