Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Xiao Wang; Guangyao Chen; Guangwu Qian; Pengcheng Gao; Xiao-Yong Wei; Yaowei Wang; Yonghong Tian; Wen Gao

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, Wen Gao

TL;DR

The paper surveys large-scale multi-modal pre-trained models (MM-PTMs), detailing the motivations, data ecosystems, architectural choices, and learning objectives that enable cross-modal understanding and generation. It covers data sources, objective functions (including contrastive and masking-based signals), and architectural paradigms (single-stream vs cross-stream) while highlighting knowledge-enhanced approaches and downstream tasks spanning generative, discriminative, and prompt-based learning. Through a synthesis of model characteristics, experimental trends, and representative results, the work delineates current capabilities and practical constraints of MM-PTMs at scale. It concludes with forward-looking directions such as incorporating more modalities, incremental and knowledge-driven pre-training, and improved prompt-based adaptation to bridge pre-training and fine-tuning in real-world deployments.

Abstract

With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey. This paper has been published by the journal Machine Intelligence Research (MIR), https://link.springer.com/article/10.1007/s11633-022-1410-8, DOI: 10.1007/s11633-022-1410-8, vol. 20, no. 4, pp. 447-482, 2023.

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

TL;DR

Abstract

Paper Structure (27 sections, 14 equations, 13 figures, 5 tables)

This paper contains 27 sections, 14 equations, 13 figures, 5 tables.

Introduction
Background
Conventional Deep Learning
Pre-training in Natural Language Processing
Pre-training in Computer Vision
Pre-training in Audio and Speech
Multi-Modal Pre-training
Task Definition and Key Challenges
Advantages of MM-PTMs
Pre-training Data
Pre-training Objectives
Pre-training Network Architecture
Self-attention and Transformer
Single- and Multi-stream
Modality Interactive Learning
...and 12 more sections

Figures (13)

Figure 1: The chronological milestones on multi-modal pre-trained big models from 2019 to the present (June 2022), including multi-modal datasets (as shown by the orange arrow) and representative models (as shown by the blue arrow). The purple font indicates that the dataset contains Chinese text (other datasets contain English text). The models highlighted in wine red are trained on more than two modalities.
Figure 2: The overall framework of this survey.
Figure 3: The detailed network architecture of Transformer network vaswani2017attention.
Figure 4: The relations between multi-modal data, model, and computing power.
Figure 5: Representative pre-training objectives used in MM-PTMs.
...and 8 more figures

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

TL;DR

Abstract

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (13)