Table of Contents
Fetching ...

Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission

Mingyu Yang, Bowen Liu, Boyang Wang, Hun-Seok Kim

TL;DR

DiffJSCC tackles perceptual-quality degradation in wireless image transmission under tight bandwidth and low-SNR conditions by introducing a diffusion-based refinement stage that leverages a pre-trained Stable Diffusion model. It obtains multimodal conditioning from the initial JSCC reconstruction, including spatial features $f_v$, textual features $f_t$, and CSI $(h,\gamma)$, and uses a fine-tuned control module to guide a latent diffusion denoiser toward realistic yet faithful reconstructions. The approach balances fidelity and realism, delivering state-of-the-art perceptual metrics (lower LPIPS and FID) and improved downstream task performance, even achieving highly realistic reconstructions for 768×512 Kodak images with only 3072 symbols (<$0.008$ symbols per pixel) at 1dB SNR. DiffJSCC is model-agnostic, imposes no transmitter overhead, and can extend to other deep JSCC architectures, offering practical benefits for mobile and IoT scenarios with constrained channels.

Abstract

Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated to be an effective approach for wireless image transmission. Nevertheless, most existing work adopts an autoencoder framework to optimize conventional criteria such as Mean Squared Error (MSE) and Structural Similarity Index (SSIM) which do not suffice to maintain the perceptual quality of reconstructed images. Such an issue is more prominent under stringent bandwidth constraints or low signal-to-noise ratio (SNR) conditions. To tackle this challenge, we propose DiffJSCC, a novel framework that leverages the prior knowledge of the pre-trained Statble Diffusion model to produce high-realism images via the conditional diffusion denoising process. Our DiffJSCC first extracts multimodal spatial and textual features from the noisy channel symbols in the generation phase. Then, it produces an initial reconstructed image as an intermediate representation to aid robust feature extraction and a stable training process. In the following diffusion step, DiffJSCC uses the derived multimodal features, together with channel state information such as the signal-to-noise ratio (SNR), as conditions to guide the denoising diffusion process, which converts the initial random noise to the final reconstruction. DiffJSCC employs a novel control module to fine-tune the Stable Diffusion model and adjust it to the multimodal conditions. Extensive experiments on diverse datasets reveal that our method significantly surpasses prior deep JSCC approaches on both perceptual metrics and downstream task performance, showcasing its ability to preserve the semantics of the original transmitted images. Notably, DiffJSCC can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols (<0.008 symbols per pixel) under 1dB SNR channels.

Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission

TL;DR

DiffJSCC tackles perceptual-quality degradation in wireless image transmission under tight bandwidth and low-SNR conditions by introducing a diffusion-based refinement stage that leverages a pre-trained Stable Diffusion model. It obtains multimodal conditioning from the initial JSCC reconstruction, including spatial features , textual features , and CSI , and uses a fine-tuned control module to guide a latent diffusion denoiser toward realistic yet faithful reconstructions. The approach balances fidelity and realism, delivering state-of-the-art perceptual metrics (lower LPIPS and FID) and improved downstream task performance, even achieving highly realistic reconstructions for 768×512 Kodak images with only 3072 symbols (< symbols per pixel) at 1dB SNR. DiffJSCC is model-agnostic, imposes no transmitter overhead, and can extend to other deep JSCC architectures, offering practical benefits for mobile and IoT scenarios with constrained channels.

Abstract

Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated to be an effective approach for wireless image transmission. Nevertheless, most existing work adopts an autoencoder framework to optimize conventional criteria such as Mean Squared Error (MSE) and Structural Similarity Index (SSIM) which do not suffice to maintain the perceptual quality of reconstructed images. Such an issue is more prominent under stringent bandwidth constraints or low signal-to-noise ratio (SNR) conditions. To tackle this challenge, we propose DiffJSCC, a novel framework that leverages the prior knowledge of the pre-trained Statble Diffusion model to produce high-realism images via the conditional diffusion denoising process. Our DiffJSCC first extracts multimodal spatial and textual features from the noisy channel symbols in the generation phase. Then, it produces an initial reconstructed image as an intermediate representation to aid robust feature extraction and a stable training process. In the following diffusion step, DiffJSCC uses the derived multimodal features, together with channel state information such as the signal-to-noise ratio (SNR), as conditions to guide the denoising diffusion process, which converts the initial random noise to the final reconstruction. DiffJSCC employs a novel control module to fine-tune the Stable Diffusion model and adjust it to the multimodal conditions. Extensive experiments on diverse datasets reveal that our method significantly surpasses prior deep JSCC approaches on both perceptual metrics and downstream task performance, showcasing its ability to preserve the semantics of the original transmitted images. Notably, DiffJSCC can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols (<0.008 symbols per pixel) under 1dB SNR channels.
Paper Structure (19 sections, 17 equations, 14 figures, 1 algorithm)

This paper contains 19 sections, 17 equations, 14 figures, 1 algorithm.

Figures (14)

  • Figure 1: Overall framework of the proposed DiffJSCC (e) and comparison with other existing deep JSCC structures (a-d).
  • Figure 2: The general framework of the proposed DiffJSCC. In the transmitter, the image $x$ is encoded by a JSCC encoder to yield transmission symbols $y$. In the receiver, a preliminary image reconstruction $\hat{x}$ is generated by a JSCC decoder, which is used to produce multimodal (visual and text) conditions. The final step is the conditional latent diffusion process, which uses these multimodal conditions in the conditional denoiser to guide the image generation procedure.
  • Figure 3: Network structure of the proposed denoiser $\epsilon_{\theta}$ in the conditional latent diffusion model. The pre-trained Stable Diffusion model is shown in blue and the fine-tuned control module is shown in green.
  • Figure 4: Evaluation of the proposed approach versus baselines across different transmission rates under 1dB (top) and 10dB SNR (bottom) on the Kodak dataset under AWGN channel. Lower LPIPS/FID scores indicate higher perceptual quality.
  • Figure 5: Evaluation of the proposed approach versus baselines across different SNR levels with transmission rates $\rho$ of $1/384$ (top) and $1/96$ (bottom) using the Kodak dataset under the AWGN channel. Lower LPIPS/FID scores indicate higher perceptual quality.
  • ...and 9 more figures