Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

Sheng Luo; Wei Chen; Wanxin Tian; Rui Liu; Luanxuan Hou; Xiubao Zhang; Haifeng Shen; Ruiqi Wu; Shuyi Geng; Yi Zhou; Ling Shao; Yi Yang; Bojun Gao; Qun Li; Guobin Wu

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou, Xiubao Zhang, Haifeng Shen, Ruiqi Wu, Shuyi Geng, Yi Zhou, Ling Shao, Yi Yang, Bojun Gao, Qun Li, Guobin Wu

TL;DR

The paper addresses the need for robust road-scene understanding via multi-modal and multi-task foundation models. It provides a comprehensive taxonomy and roadmap of task-specific, unified multi-task, unified multi-modal, and prompting-based approaches, along with prerequisites and datasets. Key contributions include up-to-date synthesis through May 2024, coverage of datasets and evaluation metrics, and a detailed discussion of open challenges such as open-world generalization, efficient transfer, continual learning, embodied interaction, and world models. The survey offers a consolidated roadmap for researchers to advance open-world, data-efficient, and interactive driving systems, supported by a continuously updated repository.

Abstract

Foundation models have indeed made a profound impact on various fields, emerging as pivotal components that significantly shape the capabilities of intelligent systems. In the context of intelligent vehicles, leveraging the power of foundation models has proven to be transformative, offering notable advancements in visual understanding. Equipped with multi-modal and multi-task learning capabilities, multi-modal multi-task visual understanding foundation models (MM-VUFMs) effectively process and fuse data from diverse modalities and simultaneously handle various driving-related tasks with powerful adaptability, contributing to a more holistic understanding of the surrounding scene. In this survey, we present a systematic analysis of MM-VUFMs specifically designed for road scenes. Our objective is not only to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques, but also to highlight their advanced capabilities in diverse learning paradigms. These paradigms include open-world understanding, efficient transfer for road scenes, continual learning, interactive and generative capability. Moreover, we provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models. To facilitate researchers in staying abreast of the latest developments in MM-VUFMs for road scenes, we have established a continuously updated repository at https://github.com/rolsheng/MM-VUFM4DS

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

TL;DR

Abstract

Paper Structure (30 sections, 5 equations, 8 figures, 1 table)

This paper contains 30 sections, 5 equations, 8 figures, 1 table.

Introduction
Prerequisites and Roadmap
Prerequisites
Basic Architectures
Multi-modal Data
Multi-modal learning
Multi-task learning
Pretraining objectives
Fine-tuing Techniques
Roadmap of Visual Understanding Foundation Models for Road Scenes
Common Practices on Visual Understanding Models for Road Scenes
Unified Multi-task Models
Unified Multi-modal Models
Prompting Foundation Models
Advanced Visual Understanding Foundation Models for Road Scenes
...and 15 more sections

Figures (8)

Figure 1: Overview of our survey at a glance. A multi-modal and multi-task foundation model for road scene understanding is defined as a framework that inputs multi-modal data and outputs multi-task results. In the section of prerequisites, we introduce some basic knowledge in advance before reading the main context. Then, we refer to up-to-date task-specific models, unified multi-task models, unified multi-task models for road scene understanding and prompting foundation models, respectively, in the section of common practice. The section of advanced models is to show strengths in diverse learning paradigms, such as open-world understanding, efficient transfer for road scene, continual learning, interactive and generative capabilities, respectively. Finally, we also list key challenges and promising future trends to address them.
Figure 2: Common multi-modal data used in road scenes. We divide them into two groups, i.e. vision-centric multi-modal data and vision-beyond multi-modal data. Solid arrows denote the strong connections between two modalities and dashed arrows denote weak connections. Vision-centric multi-modal data refer to those collected from perception sensors, usually containing detailed visual features, while vision-beyond multi-modal data refer to those springing up recently which contain more semantic and comprehensive information describing the holistic scene.
Figure 3: Roadmap of recent foundation models in driving scenarios. We divide these foundation models into LLFM, VLFM, and LVFM based on the data modality they use. LVFMs are vision-only large-vision foundation models that only take vision-centric data as input. Pretrained on large-scale datasets, these foundation models can act as robust feature representors and facilitate downstream tasks to a great extent. In contrast, LLFMs and VLFMs usually incorporate LLMs or VLMs respectively, leveraging their robust reasoning ability to perform various complicated tasks.
Figure 4: Unified multi-task models can be categorized based on their outputs into two distinct types. The first type (left) includes models with task-specific outputs, characterized by a shared encoder and individual task-specific heads across all tasks. In this type, the shared encoder processes the input data to produce 2D feature maps, and each task has its dedicated head to generate task-specific output, respectively. Conversely, the second type (right) refers to models with unified language outputs. These models consist of a shared encoder and a unified text decoder to generate texts for all tasks. The shared encoder is responsible for transforming the input data into 1D token sequences, contributing to language-based representations for all tasks.
Figure 5: Comparison of LLM-based (left) and VLM-based (right) unified multi-modal models. The LLM-based model takes LLM as a center place, which transforms multi-modal data into textual tokens that are easily modeled by LLM in the manner of sequence modeling. The VLM-based model emphasizes cross-modal interaction involving fusion, alignment, and matching across multi-modal data.
...and 3 more figures

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

TL;DR

Abstract

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

Authors

TL;DR

Abstract

Table of Contents

Figures (8)