Visual Language Model based Cross-modal Semantic Communication Systems
Feibo Jiang, Chuanguo Tang, Li Dong, Kezhi Wang, Kun Yang, Cunhua Pan
TL;DR
This work addresses ISC challenges under dynamic wireless conditions by introducing VLM-CSC, a framework that converts images to high-density textual semantics via a BLIP-based cross-modal knowledge base, and reconstructs images at the receiver with a Stable Diffusion-based KB. It innovates with a memory-assisted encoder/decoder (MED) to combat catastrophic forgetting and a noise attention module (NAM) to adapt semantic and channel coding to changing SNR. Extensive experiments on diverse datasets show that BLIP-based KBs yield superior semantic fidelity, MED mitigates forgetting across task distributions, and NAM enhances robustness across SNR, all contributing to reduced bandwidth and resilient semantic communication. Overall, VLM-CSC demonstrates effective, adaptable, and bandwidth-efficient cross-modal semantic transmission with promising practical impact for dynamic wireless environments.
Abstract
Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.
