Table of Contents
Fetching ...

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Xu Yan, Haiming Zhang, Yingjie Cai, Jingming Guo, Weichao Qiu, Bin Gao, Kaiqiang Zhou, Yue Zhao, Huan Jin, Jiantao Gao, Zhen Li, Lihui Jiang, Wei Zhang, Hongbo Zhang, Dengxin Dai, Bingbing Liu

TL;DR

Addresses the need for vision foundation models in autonomous driving and the challenges of data scarcity, sensor fusion, and task heterogeneity. Proposes a unifying VFM development pipeline and surveys techniques in data preparation, self-supervised training, and downstream adaptation. Highlights progress in NeRF, diffusion models, 3D Gaussian Splatting, and world models as data-augmentation and world-modeling strategies, with a roadmap for future work. Provides an open-source Forge_VFM4AD repository to accelerate research and reproducibility.

Abstract

The rise of large foundation models, trained on extensive datasets, is revolutionizing the field of AI. Models such as SAM, DALL-E2, and GPT-4 showcase their adaptability by extracting intricate patterns and performing effectively across diverse tasks, thereby serving as potent building blocks for a wide range of AI applications. Autonomous driving, a vibrant front in AI applications, remains challenged by the lack of dedicated vision foundation models (VFMs). The scarcity of comprehensive training data, the need for multi-sensor integration, and the diverse task-specific architectures pose significant obstacles to the development of VFMs in this field. This paper delves into the critical challenge of forging VFMs tailored specifically for autonomous driving, while also outlining future directions. Through a systematic analysis of over 250 papers, we dissect essential techniques for VFM development, including data preparation, pre-training strategies, and downstream task adaptation. Moreover, we explore key advancements such as NeRF, diffusion models, 3D Gaussian Splatting, and world models, presenting a comprehensive roadmap for future research. To empower researchers, we have built and maintained https://github.com/zhanghm1995/Forge_VFM4AD, an open-access repository constantly updated with the latest advancements in forging VFMs for autonomous driving.

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

TL;DR

Addresses the need for vision foundation models in autonomous driving and the challenges of data scarcity, sensor fusion, and task heterogeneity. Proposes a unifying VFM development pipeline and surveys techniques in data preparation, self-supervised training, and downstream adaptation. Highlights progress in NeRF, diffusion models, 3D Gaussian Splatting, and world models as data-augmentation and world-modeling strategies, with a roadmap for future work. Provides an open-source Forge_VFM4AD repository to accelerate research and reproducibility.

Abstract

The rise of large foundation models, trained on extensive datasets, is revolutionizing the field of AI. Models such as SAM, DALL-E2, and GPT-4 showcase their adaptability by extracting intricate patterns and performing effectively across diverse tasks, thereby serving as potent building blocks for a wide range of AI applications. Autonomous driving, a vibrant front in AI applications, remains challenged by the lack of dedicated vision foundation models (VFMs). The scarcity of comprehensive training data, the need for multi-sensor integration, and the diverse task-specific architectures pose significant obstacles to the development of VFMs in this field. This paper delves into the critical challenge of forging VFMs tailored specifically for autonomous driving, while also outlining future directions. Through a systematic analysis of over 250 papers, we dissect essential techniques for VFM development, including data preparation, pre-training strategies, and downstream task adaptation. Moreover, we explore key advancements such as NeRF, diffusion models, 3D Gaussian Splatting, and world models, presenting a comprehensive roadmap for future research. To empower researchers, we have built and maintained https://github.com/zhanghm1995/Forge_VFM4AD, an open-access repository constantly updated with the latest advancements in forging VFMs for autonomous driving.
Paper Structure (28 sections, 14 figures, 5 tables)

This paper contains 28 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Our survey at a glance.Background. Tis section first introduces the development of foundation models, while also delving into the diverse representations and applications within the autonomous driving community. Data Preparation. The challenge of amassing substantial volumes of data for training foundational models is particularly pronounced in the context of vision-based foundational models for autonomous driving. Our investigation encompasses an in-depth analysis of existing autonomous driving datasets acquired through on-road collection and simulators, as well as noteworthy advancements such as generative adversarial networks (GAN), diffusion models, neural radiance field (NeRF), and 3D Gaussian Splatting (3DGS) techniques. Pre-training. Self-supervised learning constitutes a pivotal aspect of our exploration. We categorize prevalent self-supervised pre-training methods into reconstruction-based, contrastive-based, distillation-based, rendering-based, and world model-based approaches. Adaptation. In bridging the gap between trained Vision Foundation Models (VFMs) and downstream tasks, we investigate the application of VFMs developed in other domains to the autonomous driving field. We acknowledge the use of images from online resources and published papers.
  • Figure 2: Research tree of forging vision foundation models for autonomous driving.
  • Figure 3: Chronological overview of the image, LiDAR, BEV and occupancy representations. Only representative approaches are demonstrated.
  • Figure 4: Illustration of diffusion-based data generation. The noise image, combined with the conditions of object bounding boxes' locations and geometry to text (G2T) description, is denoised into photo-realistic images. Images courtesy of chen2023geodiffusion.
  • Figure 5: Illustration of NeRF-based data generation methods. The 3D scene is modeled into a static background (grey) and a set of dynamic actors (red). The volume rendering is used to generate neural feature descriptors, followed by a convolutional network to decode feature patches into an image. Images courtesy of yang2023unisim.
  • ...and 9 more figures