Table of Contents
Fetching ...

VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System

Gwangyeon Ahn, Jiwan Seo, Joonhyuk Kang

TL;DR

VLF-MSC introduces a unified semantic communication paradigm that transmits a single vision-language feature (VLF) to support both image and text generation at the receiver. By leveraging a vision-language encoder (BLIP-2 Q-Former), the transmitter outputs a compact $y \in \mathbb{R}^{N\times d}$ that is robust to channel noise and mapped over a fixed channel budget. At the receiver, a text decoder (decoder-based LLM) and an image generator (diffusion model) are conditioned on the same VLF, enabling prompt-free, modality-flexible reconstruction and improving semantic fidelity under low SNR. The approach demonstrates superior semantic accuracy and spectral efficiency compared to modality-specific baselines, highlighting the practical potential of vision-language foundation models for next-generation semantic communications. Future work includes ablations on query counts, bandwidth-quality trade-offs, higher image resolutions, and scalability to real-time multi-user scenarios.

Abstract

We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.

VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System

TL;DR

VLF-MSC introduces a unified semantic communication paradigm that transmits a single vision-language feature (VLF) to support both image and text generation at the receiver. By leveraging a vision-language encoder (BLIP-2 Q-Former), the transmitter outputs a compact that is robust to channel noise and mapped over a fixed channel budget. At the receiver, a text decoder (decoder-based LLM) and an image generator (diffusion model) are conditioned on the same VLF, enabling prompt-free, modality-flexible reconstruction and improving semantic fidelity under low SNR. The approach demonstrates superior semantic accuracy and spectral efficiency compared to modality-specific baselines, highlighting the practical potential of vision-language foundation models for next-generation semantic communications. Future work includes ablations on query counts, bandwidth-quality trade-offs, higher image resolutions, and scalability to real-time multi-user scenarios.

Abstract

We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.

Paper Structure

This paper contains 24 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Overview of the proposed VLF-MSC framework. The transmitter encodes an input image $\mathbf{x}$ into a compact VLF $\mathbf{y}$ using an image encoder and Q-Former, which is transmitted over a noisy channel. At the receiver, the received VLF $\tilde{\mathbf{y}}$ jointly conditions the text and image semantic decoders to generate both the textual output $\hat{\mathbf{x}}{\text{text}}$ and the visual output $\hat{\mathbf{x}}{\text{img}}$, enabling multimodal communication without separate transmissions.
  • Figure 2: Image transmission results. Left: LPIPS (perceptual similarity) and CLIP score (semantic alignment) across SNR levels. Right: qualitative comparison of reconstructed images between VLF-MSC and Img2Img-SC.
  • Figure 3: Text transmission results. Left: BLEU (lexical overlap) and BERT score (semantic similarity) across SNR levels. Right: qualitative comparison of reconstructed texts among VLF-MSC, DeepSC, and separation baseline.
  • Figure 4: Image-text semantic reconstruction under varying SNRs using the proposed VLF-MSC. A single transmitted VLF conditions both modalities, eliminating modality-specific streams and producing semantically aligned outputs across SNRs. Bold words denote preserved key semantics.