Table of Contents
Fetching ...

Multi-Modal Self-Supervised Semantic Communication

Hang Zhao, Hongru Li, Dongfang Xu, Shenghui Song, Khaled B. Letaief

TL;DR

This work tackles training-overhead in multi-modal semantic communication by introducing a two-stage framework: (i) multi-modal self-supervised pre-training at edge devices to learn task-agnostic, cross- and intra-modal representations, and (ii) task-specific supervised fine-tuning for end-to-end edge-server inference. The pre-training objective decomposes into cross-modal and intra-modal components, $L_{pre-train} = L_{cross} + L_{intra}$, and leverages correlation matrices to separate shared and unique information across RGB and depth modalities. Empirical results on NYU Depth V2 show substantially reduced training communication while maintaining or surpassing state-of-the-art supervised methods, with faster convergence and robustness under limited labels. The approach highlights the benefits of decoupled multi-modal self-supervision for scalable, edge-centric semantic communication in dynamic wireless environments.

Abstract

Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge, we propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction. The proposed approach employs self-supervised learning during the pre-training phase to extract task-agnostic semantic features, followed by supervised fine-tuning for downstream tasks. This dual-phase strategy effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead. Experimental results on the NYU Depth V2 dataset demonstrate that the proposed method significantly reduces training-related communication overhead while maintaining or exceeding the performance of existing supervised learning approaches. The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.

Multi-Modal Self-Supervised Semantic Communication

TL;DR

This work tackles training-overhead in multi-modal semantic communication by introducing a two-stage framework: (i) multi-modal self-supervised pre-training at edge devices to learn task-agnostic, cross- and intra-modal representations, and (ii) task-specific supervised fine-tuning for end-to-end edge-server inference. The pre-training objective decomposes into cross-modal and intra-modal components, , and leverages correlation matrices to separate shared and unique information across RGB and depth modalities. Empirical results on NYU Depth V2 show substantially reduced training communication while maintaining or surpassing state-of-the-art supervised methods, with faster convergence and robustness under limited labels. The approach highlights the benefits of decoupled multi-modal self-supervision for scalable, edge-centric semantic communication in dynamic wireless environments.

Abstract

Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge, we propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction. The proposed approach employs self-supervised learning during the pre-training phase to extract task-agnostic semantic features, followed by supervised fine-tuning for downstream tasks. This dual-phase strategy effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead. Experimental results on the NYU Depth V2 dataset demonstrate that the proposed method significantly reduces training-related communication overhead while maintaining or exceeding the performance of existing supervised learning approaches. The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.

Paper Structure

This paper contains 11 sections, 16 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Framework of the multi-modal communication system.
  • Figure 2: Proposed multi-modal communication system that first conducts self-supervised encoder training, followed by end-to-end joint training. The system processes two modalities: RGB and depth data.
  • Figure 3: Mutual information analysis for single-modal and multi-modal pre-training methods.
  • Figure 4: The communication round versus test accuracy for the four methods with full labels.
  • Figure 5: The communication round versus test accuracy for the four methods with 50% labels.

Theorems & Definitions (1)

  • Remark 1