Table of Contents
Fetching ...

Prospective Role of Foundation Models in Advancing Autonomous Vehicles

Jianhua Wu, Bingzhao Gao, Jincheng Gao, Jianhao Yu, Hongqing Chu, Qiankun Yu, Xun Gong, Yi Chang, H. Eric Tseng, Hong Chen, Jie Chen

TL;DR

The paper surveys how Foundation Models (FMs), trained with self-supervised pretraining and fine-tuning, can address core AV challenges, particularly generalization under long-tail distributions and safety. It covers three FM domains relevant to AVs: language-and-vision models for scene understanding and reasoning, World Models for predictive environment modeling, and data-augmentation techniques to expand and diversify driving data. Through整理 sections on end-to-end driving, FM-enabled perception and planning, generative and non-generative World Models, and data generation, the authors synthesize representative work and identify trends and gaps. While FMs hold promise for advancing AV performance, the work also highlights critical challenges—hallucination, latency, deployment in real-time systems, and alignment with human values—that motivate future research toward robust, efficient, and trustworthy autonomous driving solutions.

Abstract

With the development of artificial intelligence and breakthroughs in deep learning, large-scale Foundation Models (FMs), such as GPT, Sora, etc., have achieved remarkable results in many fields including natural language processing and computer vision. The application of FMs in autonomous driving holds considerable promise. For example, they can contribute to enhancing scene understanding and reasoning. By pre-training on rich linguistic and visual data, FMs can understand and interpret various elements in a driving scene, and provide cognitive reasoning to give linguistic and action instructions for driving decisions and planning. Furthermore, FMs can augment data based on the understanding of driving scenarios to provide feasible scenes of those rare occurrences in the long tail distribution that are unlikely to be encountered during routine driving and data collection. The enhancement can subsequently lead to improvement in the accuracy and reliability of autonomous driving systems. Another testament to the potential of FMs' applications lies in World Models, exemplified by the DREAMER series, which showcases the ability to comprehend physical laws and dynamics. Learning from massive data under the paradigm of self-supervised learning, World Model can generate unseen yet plausible driving environments, facilitating the enhancement in the prediction of road users' behaviors and the off-line training of driving strategies. In this paper, we synthesize the applications and future trends of FMs in autonomous driving. By utilizing the powerful capabilities of FMs, we strive to tackle the potential issues stemming from the long-tail distribution in autonomous driving, consequently advancing overall safety in this domain.

Prospective Role of Foundation Models in Advancing Autonomous Vehicles

TL;DR

The paper surveys how Foundation Models (FMs), trained with self-supervised pretraining and fine-tuning, can address core AV challenges, particularly generalization under long-tail distributions and safety. It covers three FM domains relevant to AVs: language-and-vision models for scene understanding and reasoning, World Models for predictive environment modeling, and data-augmentation techniques to expand and diversify driving data. Through整理 sections on end-to-end driving, FM-enabled perception and planning, generative and non-generative World Models, and data generation, the authors synthesize representative work and identify trends and gaps. While FMs hold promise for advancing AV performance, the work also highlights critical challenges—hallucination, latency, deployment in real-time systems, and alignment with human values—that motivate future research toward robust, efficient, and trustworthy autonomous driving solutions.

Abstract

With the development of artificial intelligence and breakthroughs in deep learning, large-scale Foundation Models (FMs), such as GPT, Sora, etc., have achieved remarkable results in many fields including natural language processing and computer vision. The application of FMs in autonomous driving holds considerable promise. For example, they can contribute to enhancing scene understanding and reasoning. By pre-training on rich linguistic and visual data, FMs can understand and interpret various elements in a driving scene, and provide cognitive reasoning to give linguistic and action instructions for driving decisions and planning. Furthermore, FMs can augment data based on the understanding of driving scenarios to provide feasible scenes of those rare occurrences in the long tail distribution that are unlikely to be encountered during routine driving and data collection. The enhancement can subsequently lead to improvement in the accuracy and reliability of autonomous driving systems. Another testament to the potential of FMs' applications lies in World Models, exemplified by the DREAMER series, which showcases the ability to comprehend physical laws and dynamics. Learning from massive data under the paradigm of self-supervised learning, World Model can generate unseen yet plausible driving environments, facilitating the enhancement in the prediction of road users' behaviors and the off-line training of driving strategies. In this paper, we synthesize the applications and future trends of FMs in autonomous driving. By utilizing the powerful capabilities of FMs, we strive to tackle the potential issues stemming from the long-tail distribution in autonomous driving, consequently advancing overall safety in this domain.
Paper Structure (22 sections, 8 figures, 3 tables)

This paper contains 22 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Scaling Laws10
  • Figure 2: Emergent abilities of LLMs9
  • Figure 3: The pipeline diagram for the supervised end-to-end autonomous driving system with a Pretraining Backbone. Multi-modal sensing information is input to the Pretraining Backbone to extract features, after which it enters into the framework of autonomous driving algorithms built by various methods to realize tasks, such as planning/control, to accomplish end-to-end autonomous driving tasks.
  • Figure 4: The pipeline diagram for enhancing autonomous driving leveraging FMs, where FMs refer to language models and vision models. FMs can learn perceptual information and utilize their powerful ability to understand the driving scenarios and reason to give language-guided instructions and driving actions to enhance autonomous driving.
  • Figure 5: For the application of LLMs to autonomous driving system decision-making, a typical pipeline is shown in this figure, referenced from DriveMLMwang2023drivemlm.
  • ...and 3 more figures