Table of Contents
Fetching ...

Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

Eleonora Grassucci, Sergio Barbarossa, Danilo Comminiello

TL;DR

This work tackles task-oriented semantic communications by transmitting compact semantic maps and using a diffusion-based receiver to synthesize semantically faithful images under severe channel distortion. It introduces a robust generative semantic framework where one-hot semantic maps are encoded with strong compression and denoising, then conditioned diffusion generation—with SPADE conditioning—to recover shapes, positions, and depth for downstream tasks. The model is trained with noisy maps and includes a Fast Denoising Semantic block, optimizing a combined denoising and KL-divergence loss that aligns with an ELBO objective. Across Cityscapes and COCO-Stuff datasets, the approach outperforms GAN- and diffusion-based baselines on FID, LPIPS, and segmentation/depth metrics at low PSNR, while requiring far fewer transmitted bits; the authors also provide code at the referenced repository.

Abstract

Semantic communication is expected to be one of the cores of next-generation AI-based communications. One of the possibilities offered by semantic communication is the capability to regenerate, at the destination side, images or videos semantically equivalent to the transmitted ones, without necessarily recovering the transmitted sequence of bits. The current solutions still lack the ability to build complex scenes from the received partial information. Clearly, there is an unmet need to balance the effectiveness of generation methods and the complexity of the transmitted information, possibly taking into account the goal of communication. In this paper, we aim to bridge this gap by proposing a novel generative diffusion-guided framework for semantic communication that leverages the strong abilities of diffusion models in synthesizing multimedia content while preserving semantic features. We reduce bandwidth usage by sending highly-compressed semantic information only. Then, the diffusion model learns to synthesize semantic-consistent scenes through spatially-adaptive normalizations from such denoised semantic information. We prove, through an in-depth assessment of multiple scenarios, that our method outperforms existing solutions in generating high-quality images with preserved semantic information even in cases where the received content is significantly degraded. More specifically, our results show that objects, locations, and depths are still recognizable even in the presence of extremely noisy conditions of the communication channel. The code is available at https://github.com/ispamm/GESCO.

Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

TL;DR

This work tackles task-oriented semantic communications by transmitting compact semantic maps and using a diffusion-based receiver to synthesize semantically faithful images under severe channel distortion. It introduces a robust generative semantic framework where one-hot semantic maps are encoded with strong compression and denoising, then conditioned diffusion generation—with SPADE conditioning—to recover shapes, positions, and depth for downstream tasks. The model is trained with noisy maps and includes a Fast Denoising Semantic block, optimizing a combined denoising and KL-divergence loss that aligns with an ELBO objective. Across Cityscapes and COCO-Stuff datasets, the approach outperforms GAN- and diffusion-based baselines on FID, LPIPS, and segmentation/depth metrics at low PSNR, while requiring far fewer transmitted bits; the authors also provide code at the referenced repository.

Abstract

Semantic communication is expected to be one of the cores of next-generation AI-based communications. One of the possibilities offered by semantic communication is the capability to regenerate, at the destination side, images or videos semantically equivalent to the transmitted ones, without necessarily recovering the transmitted sequence of bits. The current solutions still lack the ability to build complex scenes from the received partial information. Clearly, there is an unmet need to balance the effectiveness of generation methods and the complexity of the transmitted information, possibly taking into account the goal of communication. In this paper, we aim to bridge this gap by proposing a novel generative diffusion-guided framework for semantic communication that leverages the strong abilities of diffusion models in synthesizing multimedia content while preserving semantic features. We reduce bandwidth usage by sending highly-compressed semantic information only. Then, the diffusion model learns to synthesize semantic-consistent scenes through spatially-adaptive normalizations from such denoised semantic information. We prove, through an in-depth assessment of multiple scenarios, that our method outperforms existing solutions in generating high-quality images with preserved semantic information even in cases where the received content is significantly degraded. More specifically, our results show that objects, locations, and depths are still recognizable even in the presence of extremely noisy conditions of the communication channel. The code is available at https://github.com/ispamm/GESCO.
Paper Structure (16 sections, 15 equations, 11 figures, 9 tables)

This paper contains 16 sections, 15 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Proposed generative semantic communication framework. The sender transmits one-hot, compressed, and normalized encoded maps over the noisy channel. The receiver takes the noisy maps and directly involves them to train the semantic diffusion model. During inference, the receiver applies fast denoising to the semantic information in order to improve image quality.
  • Figure 2: Encoder and decoder blocks of our U-Net-based semantic diffusion model. The SPADE module in the decoder allows the semantic conditioning.
  • Figure 3: Comparison among the methods for transmitted semantics and a fixed PSNR value of $10$.
  • Figure 4: Comparisons among most performing models (CC-FPSE liu2019learning, OASIS schonfeld2021you, and SMIS Zhu2020SemanticallyMI) with $\text{PSNR}=15$. Other methods produce almost noise-only images. Our method produces the best quality samples in which it is easy to recognize objects, cars, and pedestrians, while comparisons generate scenes heavily corrupted by noise.
  • Figure 5: Synthesized images from the transmitted semantics with $\text{PSNR}=10$ for the classical method, SMIS, ControlNet, and our method. The detector can still recognize objects in our generated sample, while other images are too noisy or without preserved semantics such as in ControlNet. The depth estimation confirms the better quality of our generation by correctly estimating distances from objects while producing blurred maps for comparisons.
  • ...and 6 more figures