Table of Contents
Fetching ...

Language-Oriented Semantic Latent Representation for Image Transmission

Giordano Cicchetti, Eleonora Grassucci, Jihong Park, Jinho Choi, Sergio Barbarossa, Danilo Comminiello

TL;DR

This paper tackles the problem that language-only semantic communication (I2T) can miss fine visual details. It proposes a framework that simultaneously transmits a textual caption $y$ and a compact latent image embedding $z$, using a latent diffusion model conditioned on both to reconstruct the image at the receiver. The approach achieves substantial bandwidth savings (payload about 2.09% of the original image) while improving perceptual fidelity over text-only baselines, especially in moderate-to-high SNR scenarios. This work enables adaptive, bandwidth-efficient image transmission and suggests avenues for extending semantic communication to other media and more compact semantic representations.

Abstract

In the new paradigm of semantic communication (SC), the focus is on delivering meanings behind bits by extracting semantic information from raw data. Recent advances in data-to-text models facilitate language-oriented SC, particularly for text-transformed image communication via image-to-text (I2T) encoding and text-to-image (T2I) decoding. However, although semantically aligned, the text is too coarse to precisely capture sophisticated visual features such as spatial locations, color, and texture, incurring a significant perceptual difference between intended and reconstructed images. To address this limitation, in this paper, we propose a novel language-oriented SC framework that communicates both text and a compressed image embedding and combines them using a latent diffusion model to reconstruct the intended image. Experimental results validate the potential of our approach, which transmits only 2.09\% of the original image size while achieving higher perceptual similarities in noisy communication channels compared to a baseline SC method that communicates only through text.The code is available at https://github.com/ispamm/Img2Img-SC/ .

Language-Oriented Semantic Latent Representation for Image Transmission

TL;DR

This paper tackles the problem that language-only semantic communication (I2T) can miss fine visual details. It proposes a framework that simultaneously transmits a textual caption and a compact latent image embedding , using a latent diffusion model conditioned on both to reconstruct the image at the receiver. The approach achieves substantial bandwidth savings (payload about 2.09% of the original image) while improving perceptual fidelity over text-only baselines, especially in moderate-to-high SNR scenarios. This work enables adaptive, bandwidth-efficient image transmission and suggests avenues for extending semantic communication to other media and more compact semantic representations.

Abstract

In the new paradigm of semantic communication (SC), the focus is on delivering meanings behind bits by extracting semantic information from raw data. Recent advances in data-to-text models facilitate language-oriented SC, particularly for text-transformed image communication via image-to-text (I2T) encoding and text-to-image (T2I) decoding. However, although semantically aligned, the text is too coarse to precisely capture sophisticated visual features such as spatial locations, color, and texture, incurring a significant perceptual difference between intended and reconstructed images. To address this limitation, in this paper, we propose a novel language-oriented SC framework that communicates both text and a compressed image embedding and combines them using a latent diffusion model to reconstruct the intended image. Experimental results validate the potential of our approach, which transmits only 2.09\% of the original image size while achieving higher perceptual similarities in noisy communication channels compared to a baseline SC method that communicates only through text.The code is available at https://github.com/ispamm/Img2Img-SC/ .
Paper Structure (8 sections, 5 equations, 5 figures, 1 table)

This paper contains 8 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Random samples from Flickr-8k dataset Hodosh_Young_Hockenmaier_2013. On the left-hand side original images. On the right-hand side regenerated images using our framework and different conditioning signals. Images regenerated using both textual description and image embeddings are not only semantically aligned but also perceptually very similar to the original ones.
  • Figure 2: Overview of the proposed framework. At the sender side, we employ both an image-to-text (I2T) model and an image encoder network. The I2T model produces the textual image caption while the image encoder encapsulates the scheme and the intrinsic semantics of an image in a latent representation. We transmit over the noisy network both text and image embeddings. At the receiver side, the content is regenerated using a latent diffusion model. This generative model takes noisy image embedding and applies $T$ diffusive steps conditioned on textual captions. At the end, the decoder network brings back the image to the original dimensionality.
  • Figure 3: Comparison between the two proposed approaches in a noisy channel scenario. Metrics taken into consideration: LPIPS, SSIM, CLIP Score and FID. For LPIPS and FID the lower the best. For SSIM and CLIP Score the higher the best.
  • Figure 4: Visual results. On the left-hand side three randomly selected samples along with text captions automatically extracted by BLIP-large model li2022blip. On the right-hand side, there are reconstructed images. Numbers under images refer to the LPIPS score between regenerated images and intended ones. The SNR value is set to 7.5 dB.
  • Figure 5: Visual results for different sampling timesteps. On the left-hand side three randomly selected samples. On the right-hand side reconstructed images using different strategies. Timesteps used are $T=[30,10,0]$. SNR value is set to 7.5 dB.