Table of Contents
Fetching ...

Large AI Model Empowered Multimodal Semantic Communications

Feibo Jiang, Li Dong, Yubo Peng, Kezhi Wang, Kun Yang, Cunhua Pan, Xiaohu You

TL;DR

This work tackles multimodal semantic communication over wireless channels by introducing LAM-MSC, a framework that unifies heterogeneous data via CoDi-based multimodal alignment (MMA) into a text modality, uses a GPT-4–driven personalized knowledge base (LKB) to resolve semantic ambiguities, and applies a CGAN-based channel estimation (CGE) to mitigate fading. It leverages large AI models to perform accurate semantic extraction, personalized semantic recovery, and robust channel inference, aiming for low latency and high semantic fidelity. Experiments on VOC2012, LibriSpeech, and UCF101 demonstrate that LAM-MSC achieves superior semantic transmission accuracy and data compression compared to unimodal or non-CGE baselines, with the cosine similarity threshold for correctness set at $0.6$. The approach underscores the potential of integrating MLMs/LLMs into multimodal SC to enable scalable, personalized, and efficient semantic communications in future wireless systems.

Abstract

Multimodal signals, including text, audio, image, and video, can be integrated into Semantic Communication (SC) systems to provide an immersive experience with low latency and high quality at the semantic level. However, the multimodal SC has several challenges, including data heterogeneity, semantic ambiguity, and signal distortion during transmission. Recent advancements in large AI models, particularly in the Multimodal Language Model (MLM) and Large Language Model (LLM), offer potential solutions for addressing these issues. To this end, we propose a Large AI Model-based Multimodal SC (LAM-MSC) framework, where we first present the MLM-based Multimodal Alignment (MMA) that utilizes the MLM to enable the transformation between multimodal and unimodal data while preserving semantic consistency. Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery through the LLM. This effectively addresses the semantic ambiguity. Finally, we apply the Conditional Generative adversarial network-based channel Estimation (CGE) for estimating the wireless channel state information. This approach effectively mitigates the impact of fading channels in SC. Finally, we conduct simulations that demonstrate the superior performance of the LAM-MSC framework.

Large AI Model Empowered Multimodal Semantic Communications

TL;DR

This work tackles multimodal semantic communication over wireless channels by introducing LAM-MSC, a framework that unifies heterogeneous data via CoDi-based multimodal alignment (MMA) into a text modality, uses a GPT-4–driven personalized knowledge base (LKB) to resolve semantic ambiguities, and applies a CGAN-based channel estimation (CGE) to mitigate fading. It leverages large AI models to perform accurate semantic extraction, personalized semantic recovery, and robust channel inference, aiming for low latency and high semantic fidelity. Experiments on VOC2012, LibriSpeech, and UCF101 demonstrate that LAM-MSC achieves superior semantic transmission accuracy and data compression compared to unimodal or non-CGE baselines, with the cosine similarity threshold for correctness set at . The approach underscores the potential of integrating MLMs/LLMs into multimodal SC to enable scalable, personalized, and efficient semantic communications in future wireless systems.

Abstract

Multimodal signals, including text, audio, image, and video, can be integrated into Semantic Communication (SC) systems to provide an immersive experience with low latency and high quality at the semantic level. However, the multimodal SC has several challenges, including data heterogeneity, semantic ambiguity, and signal distortion during transmission. Recent advancements in large AI models, particularly in the Multimodal Language Model (MLM) and Large Language Model (LLM), offer potential solutions for addressing these issues. To this end, we propose a Large AI Model-based Multimodal SC (LAM-MSC) framework, where we first present the MLM-based Multimodal Alignment (MMA) that utilizes the MLM to enable the transformation between multimodal and unimodal data while preserving semantic consistency. Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery through the LLM. This effectively addresses the semantic ambiguity. Finally, we apply the Conditional Generative adversarial network-based channel Estimation (CGE) for estimating the wireless channel state information. This approach effectively mitigates the impact of fading channels in SC. Finally, we conduct simulations that demonstrate the superior performance of the LAM-MSC framework.
Paper Structure (38 sections, 5 figures)

This paper contains 38 sections, 5 figures.

Figures (5)

  • Figure 1: Traditional unimodal SC system versus multimodal SC system.
  • Figure 2: The workflow of the proposed LAM-MSC framework.
  • Figure 3: A dataflow example of the proposed LAM-MSC framework: Sender Mike dispatches an image to receiver Jane with the intention of conveying the semantic content of the image as "Mike and Jane are playing in a garden."
  • Figure 4: Transmission accuracy of multimodal SC under different SNRs.
  • Figure 5: Comparison results of different schemes.