Self-Supervised Multimodal Learning: A Survey

Yongshuo Zong; Oisin Mac Aodha; Timothy Hospedales

Self-Supervised Multimodal Learning: A Survey

Yongshuo Zong, Oisin Mac Aodha, Timothy Hospedales

TL;DR

Self-Supervised Multimodal Learning (SSML) surveys how to scale multimodal models without manual labels by leveraging self-supervised objectives across instance discrimination, clustering, and masked prediction. It analyzes architectural choices for fusion, including fusion-free, early/unified encoders, and the stitching paradigm that reuses pretrained unimodal models, along with both coarse- and fine-grained unalignment strategies. The survey emphasizes practical applications in healthcare, remote sensing, and state representation for control, and discusses theoretical gaps, challenges, and directions for scalable, robust, and fair SSML systems. Overall, SSML offers scalable pathways to general-purpose multimodal intelligence by combining diverse objectives, fusion strategies, and unaligned data handling, while highlighting the need for data-centric and theoretical advances to sustain progress.

Abstract

Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to learn from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, in which we elucidate three major challenges intrinsic to self-supervised learning with multimodal data: (1) learning representations from multimodal data without labels, (2) fusion of different modalities, and (3) learning with unaligned data. We then detail existing solutions to these challenges. Specifically, we consider (1) objectives for learning from multimodal unlabeled data via self-supervision, (2) model architectures from the perspective of different multimodal fusion strategies, and (3) pair-free learning strategies for coarse-grained and fine-grained alignment. We also review real-world applications of SSML algorithms in diverse fields such as healthcare, remote sensing, and machine translation. Finally, we discuss challenges and future directions for SSML. A collection of related resources can be found at: https://github.com/ys-zong/awesome-self-supervised-multimodal-learning.

Self-Supervised Multimodal Learning: A Survey

TL;DR

Abstract

Paper Structure (44 sections, 22 equations, 7 figures, 4 tables)

This paper contains 44 sections, 22 equations, 7 figures, 4 tables.

Introduction
Background
Notation
Learning Paradigms
Scope of the Survey
Self-supervision in Multimodal Learning
Multimodal v.s. Multiview
Generative vs Self-Supervised Models
Multimodal Learning without Labels
Instance Discrimination
Contrastive
Matching Prediction
Clustering
Masked Prediction
Auto-encoding Masked Prediction
...and 29 more sections

Figures (7)

Figure 1: Challenges and solutions for self-supervised multimodal learning.
Figure 2: Learning paradigms for (a) supervised multimodal learning, and (b) self-supervised multimodal learning; illustrating self-supervised pretraining without manual annotations (top) and supervised fine-tuning or linear readout for downstream tasks (bottom).
Figure 3: An illustrative schematic of instance discrimination objectives.
Figure 4: An illustrative schematic of masked prediction frameworks.
Figure 5: Illustration of different modality fusion architectures.
...and 2 more figures

Self-Supervised Multimodal Learning: A Survey

TL;DR

Abstract

Self-Supervised Multimodal Learning: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (7)