Table of Contents
Fetching ...

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Shohreh Deldari, Hao Xue, Aaqib Saeed, Jiayuan He, Daniel V. Smith, Flora D. Salim

TL;DR

This survey addresses self-supervised representation learning for multimodal and temporal data, tackling annotation bottlenecks by synthesising methods that leverage data-derived supervisory signals. It introduces a unified SSRL pipeline and a four-part taxonomy (pretext, contrastive, clustering, regularisation), and reviews temporal and multimodal methods across sensors, audio, video, and text-augmented data. Key contributions include a comprehensive categorisation, cross-modal and temporal extensions, and a discussion of challenges such as contrastive pair construction, domain-agnostic representations, robustness to irregular data, and augmentation strategies. The work provides a practical roadmap for selecting models and identifying opportunities to advance SSRL in real-world, multimodal temporal settings.

Abstract

Recently, Self-Supervised Representation Learning (SSRL) has attracted much attention in the field of computer vision, speech, natural language processing (NLP), and recently, with other types of modalities, including time series from sensors. The popularity of self-supervised learning is driven by the fact that traditional models typically require a huge amount of well-annotated data for training. Acquiring annotated data can be a difficult and costly process. Self-supervised methods have been introduced to improve the efficiency of training data through discriminative pre-training of models using supervisory signals that have been freely obtained from the raw data. Unlike existing reviews of SSRL that have pre-dominately focused upon methods in the fields of CV or NLP for a single modality, we aim to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data. To this end, we 1) provide a comprehensive categorization of existing SSRL methods, 2) introduce a generic pipeline by defining the key components of a SSRL framework, 3) compare existing models in terms of their objective function, network architecture and potential applications, and 4) review existing multimodal techniques in each category and various modalities. Finally, we present existing weaknesses and future opportunities. We believe our work develops a perspective on the requirements of SSRL in domains that utilise multimodal and/or temporal data

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

TL;DR

This survey addresses self-supervised representation learning for multimodal and temporal data, tackling annotation bottlenecks by synthesising methods that leverage data-derived supervisory signals. It introduces a unified SSRL pipeline and a four-part taxonomy (pretext, contrastive, clustering, regularisation), and reviews temporal and multimodal methods across sensors, audio, video, and text-augmented data. Key contributions include a comprehensive categorisation, cross-modal and temporal extensions, and a discussion of challenges such as contrastive pair construction, domain-agnostic representations, robustness to irregular data, and augmentation strategies. The work provides a practical roadmap for selecting models and identifying opportunities to advance SSRL in real-world, multimodal temporal settings.

Abstract

Recently, Self-Supervised Representation Learning (SSRL) has attracted much attention in the field of computer vision, speech, natural language processing (NLP), and recently, with other types of modalities, including time series from sensors. The popularity of self-supervised learning is driven by the fact that traditional models typically require a huge amount of well-annotated data for training. Acquiring annotated data can be a difficult and costly process. Self-supervised methods have been introduced to improve the efficiency of training data through discriminative pre-training of models using supervisory signals that have been freely obtained from the raw data. Unlike existing reviews of SSRL that have pre-dominately focused upon methods in the fields of CV or NLP for a single modality, we aim to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data. To this end, we 1) provide a comprehensive categorization of existing SSRL methods, 2) introduce a generic pipeline by defining the key components of a SSRL framework, 3) compare existing models in terms of their objective function, network architecture and potential applications, and 4) review existing multimodal techniques in each category and various modalities. Finally, we present existing weaknesses and future opportunities. We believe our work develops a perspective on the requirements of SSRL in domains that utilise multimodal and/or temporal data
Paper Structure (29 sections, 5 figures, 9 tables)

This paper contains 29 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Electrocardiogram (ECG) signals in two different views: (1) A sample ECG waveform of a) cardiac arrhythmia (ARR) and c) Normal Sinus Rhythm (NSR); and (2) scalogram of ARR(b) and NSR (d) using continuous wavelet transform.
  • Figure 2: Supervised vs self-supervised
  • Figure 3: self-supervised representation learning (SSRL) workflow. First, SSRL methods take unlabeled data as inputs, extract new instances and their corresponding pseudo labels using various techniques, such as data transformation, temporal/spatial masking, innate relation and cross-modality matching. Next, representations are learned with the aim of predicting those extracted pseudo labels. Finally, a pre-trained encoder will be transferred to a supervised/unsupervised downstream task with limited labeled data.
  • Figure 4: Categories of self-supervised representation learning frameworks applicable across all of underlying architecture and modalities.
  • Figure 5: Comparing overall architecture of different self-supervised representation learning models.