Table of Contents
Fetching ...

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Muhammad Arslan Manzoor, Sarah Albarri, Ziting Xian, Zaiqiao Meng, Preslav Nakov, Shangsong Liang

TL;DR

This survey analyzes the evolution of multimodal representation learning from task-specific architectures to large-scale multimodal pretraining, emphasizing transformer backbones and unifying designs. It catalogs pretraining types, objectives, and architectures, and canvasses a wide range of applications across understanding, classification, generation, retrieval, and translation, supported by benchmark datasets. The work highlights major advances in vision–language and audio–visual modeling, discusses remaining challenges, and outlines future directions including multilingual data, instruction tuning, and efficiency. Overall, the paper provides a comprehensive framework and dataset map to guide researchers toward scalable, multitask, and multilingual multimodal systems.

Abstract

Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at https://github.com/marslanm/multimodality-representation-learning.

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

TL;DR

This survey analyzes the evolution of multimodal representation learning from task-specific architectures to large-scale multimodal pretraining, emphasizing transformer backbones and unifying designs. It catalogs pretraining types, objectives, and architectures, and canvasses a wide range of applications across understanding, classification, generation, retrieval, and translation, supported by benchmark datasets. The work highlights major advances in vision–language and audio–visual modeling, discusses remaining challenges, and outlines future directions including multilingual data, instruction tuning, and efficiency. Overall, the paper provides a comprehensive framework and dataset map to guide researchers toward scalable, multitask, and multilingual multimodal systems.

Abstract

Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at https://github.com/marslanm/multimodality-representation-learning.
Paper Structure (42 sections, 3 figures, 4 tables)

This paper contains 42 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The pie chart on the left represents the percentage of papers for each section included in this survey. The pie chart in the center represents the percentage of papers for each multimodal application. The rightmost figure expresses the growth of deep learning-based multimodal papers in the last six years on Google Scholar.
  • Figure 2: (a) Task-specific multimodal methods are trained on benchmarks created for specialized downstream tasks. After preprocessing, each encoder ($\mathrm{E}_i$) receives one modality ($\mathrm{M}_i$) data to produce an embedding. The fusion module is responsible for the interaction of features from different modalities and is trained to predict the downstream task. (b) In pretraining, the raw data from the web or dataset, in the form of any modality ($\mathrm{M}_i$), is preprocessed and passed to a specialized encoder ($\mathrm{M}_i$) at the encoding stage. Encoders produce embeddings which are integrated by the fusion module to produce a meaningful unified representation by predicting pretext tasks. The blue-bounded box in the pretraining stage represents the foundation model, which is readily available at the fine-tuning stage as a pretrained model. A multimodal dataset for the downstream task can leverage this generic pretrained model for downstream prediction.
  • Figure 3: Taxonomy of the multimodal applications in Section \ref{['sec4']}.