Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Shohreh Deldari; Hao Xue; Aaqib Saeed; Jiayuan He; Daniel V. Smith; Flora D. Salim

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Shohreh Deldari, Hao Xue, Aaqib Saeed, Jiayuan He, Daniel V. Smith, Flora D. Salim

TL;DR

This survey addresses self-supervised representation learning for multimodal and temporal data, tackling annotation bottlenecks by synthesising methods that leverage data-derived supervisory signals. It introduces a unified SSRL pipeline and a four-part taxonomy (pretext, contrastive, clustering, regularisation), and reviews temporal and multimodal methods across sensors, audio, video, and text-augmented data. Key contributions include a comprehensive categorisation, cross-modal and temporal extensions, and a discussion of challenges such as contrastive pair construction, domain-agnostic representations, robustness to irregular data, and augmentation strategies. The work provides a practical roadmap for selecting models and identifying opportunities to advance SSRL in real-world, multimodal temporal settings.

Abstract

Recently, Self-Supervised Representation Learning (SSRL) has attracted much attention in the field of computer vision, speech, natural language processing (NLP), and recently, with other types of modalities, including time series from sensors. The popularity of self-supervised learning is driven by the fact that traditional models typically require a huge amount of well-annotated data for training. Acquiring annotated data can be a difficult and costly process. Self-supervised methods have been introduced to improve the efficiency of training data through discriminative pre-training of models using supervisory signals that have been freely obtained from the raw data. Unlike existing reviews of SSRL that have pre-dominately focused upon methods in the fields of CV or NLP for a single modality, we aim to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data. To this end, we 1) provide a comprehensive categorization of existing SSRL methods, 2) introduce a generic pipeline by defining the key components of a SSRL framework, 3) compare existing models in terms of their objective function, network architecture and potential applications, and 4) review existing multimodal techniques in each category and various modalities. Finally, we present existing weaknesses and future opportunities. We believe our work develops a perspective on the requirements of SSRL in domains that utilise multimodal and/or temporal data

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

TL;DR

Abstract

Paper Structure (29 sections, 5 figures, 9 tables)

This paper contains 29 sections, 5 figures, 9 tables.

Introduction
Representation Learning
Self-Supervised Representation Learning
Self-Supervised Representation Learning of Multimodal and Temporal Data
Motivation and Contributions
Related Surveys
Self-Supervised Representation Learning: Definitions and Background
Definitions
Background and Frameworks
Self-Supervised Representation Learning for Temporal Data
Pretext Task
Contrastive Models
Clustering
Regularisation-Based Models
Self-Supervised Representation Learning in Multimodal Data
...and 14 more sections

Figures (5)

Figure 1: Electrocardiogram (ECG) signals in two different views: (1) A sample ECG waveform of a) cardiac arrhythmia (ARR) and c) Normal Sinus Rhythm (NSR); and (2) scalogram of ARR(b) and NSR (d) using continuous wavelet transform.
Figure 2: Supervised vs self-supervised
Figure 3: self-supervised representation learning (SSRL) workflow. First, SSRL methods take unlabeled data as inputs, extract new instances and their corresponding pseudo labels using various techniques, such as data transformation, temporal/spatial masking, innate relation and cross-modality matching. Next, representations are learned with the aim of predicting those extracted pseudo labels. Finally, a pre-trained encoder will be transferred to a supervised/unsupervised downstream task with limited labeled data.
Figure 4: Categories of self-supervised representation learning frameworks applicable across all of underlying architecture and modalities.
Figure 5: Comparing overall architecture of different self-supervised representation learning models.

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

TL;DR

Abstract

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Authors

TL;DR

Abstract

Table of Contents

Figures (5)