Table of Contents
Fetching ...

Transformer-aided Wireless Image Transmission with Channel Feedback

Haotian Wu, Yulin Shao, Emre Ozfatura, Krystian Mikolajczyk, Deniz Gündüz

TL;DR

This work tackles wireless image transmission with receiver feedback by introducing JSCCformer-f, a unified vision-transformer (ViT) based encoder/decoder pair that leverages blockwise feedback to refine the receiver's belief across multiple iterations. The core idea is to transform the image into a sequence, embed feedback signals, and use a single ViT-based encoder to generate per-block channel symbols, while a symmetric ViT-based decoder reconstructs the image from accumulated channel outputs. Two feedback-embedding schemes (JSCCformer-f and JSCCformer-f lite) balance performance and complexity, and the approach demonstrates state-of-the-art PSNR and perceptual quality (LPIPS) across a wide range of SNRs and bandwidths, with robustness to noisy feedback and adaptation to varying channel conditions. The paper also extends the framework to broadcast channels, showing notable improvements over digital schemes in multi-receiver scenarios and highlighting efficiency, scalability, and low-latency decoding. Overall, JSCCformer-f offers a practical, semantics-aware, feedback-driven JSCC solution for robust, high-quality wireless image transmission in IoT and edge settings, with flexible rate control and broad generalization potential.

Abstract

This paper presents a novel wireless image transmission paradigm that can exploit feedback from the receiver, called DeepJSCC-ViT-f. We consider a block feedback channel model, where the transmitter receives noiseless/noisy channel output feedback after each block. The proposed scheme employs a single encoder to facilitate transmission over multiple blocks, refining the receiver's estimation at each block. Specifically, the unified encoder of DeepJSCC-ViT-f can leverage the semantic information from the source image, and acquire channel state information and the decoder's current belief about the source image from the feedback signal to generate coded symbols at each block. Numerical experiments show that our DeepJSCC-ViT-f scheme achieves state-of-the-art transmission performance with robustness to noise in the feedback link. Additionally, DeepJSCC-ViT-f can adapt to the channel condition directly through feedback without the need for separate channel estimation. We further extend the scope of the DeepJSCC-ViT-f approach to include the broadcast channel, which enables the transmitter to generate broadcast codes in accordance with signal semantics and channel feedback from individual receivers.

Transformer-aided Wireless Image Transmission with Channel Feedback

TL;DR

This work tackles wireless image transmission with receiver feedback by introducing JSCCformer-f, a unified vision-transformer (ViT) based encoder/decoder pair that leverages blockwise feedback to refine the receiver's belief across multiple iterations. The core idea is to transform the image into a sequence, embed feedback signals, and use a single ViT-based encoder to generate per-block channel symbols, while a symmetric ViT-based decoder reconstructs the image from accumulated channel outputs. Two feedback-embedding schemes (JSCCformer-f and JSCCformer-f lite) balance performance and complexity, and the approach demonstrates state-of-the-art PSNR and perceptual quality (LPIPS) across a wide range of SNRs and bandwidths, with robustness to noisy feedback and adaptation to varying channel conditions. The paper also extends the framework to broadcast channels, showing notable improvements over digital schemes in multi-receiver scenarios and highlighting efficiency, scalability, and low-latency decoding. Overall, JSCCformer-f offers a practical, semantics-aware, feedback-driven JSCC solution for robust, high-quality wireless image transmission in IoT and edge settings, with flexible rate control and broad generalization potential.

Abstract

This paper presents a novel wireless image transmission paradigm that can exploit feedback from the receiver, called DeepJSCC-ViT-f. We consider a block feedback channel model, where the transmitter receives noiseless/noisy channel output feedback after each block. The proposed scheme employs a single encoder to facilitate transmission over multiple blocks, refining the receiver's estimation at each block. Specifically, the unified encoder of DeepJSCC-ViT-f can leverage the semantic information from the source image, and acquire channel state information and the decoder's current belief about the source image from the feedback signal to generate coded symbols at each block. Numerical experiments show that our DeepJSCC-ViT-f scheme achieves state-of-the-art transmission performance with robustness to noise in the feedback link. Additionally, DeepJSCC-ViT-f can adapt to the channel condition directly through feedback without the need for separate channel estimation. We further extend the scope of the DeepJSCC-ViT-f approach to include the broadcast channel, which enables the transmitter to generate broadcast codes in accordance with signal semantics and channel feedback from individual receivers.
Paper Structure (30 sections, 19 equations, 18 figures, 11 tables, 2 algorithms)

This paper contains 30 sections, 19 equations, 18 figures, 11 tables, 2 algorithms.

Figures (18)

  • Figure 1: Illustration of alternative JSCC schemes for channels with feedback, where the solid lines represent forward channels, while the dotted lines represent feedback channels. Top: Illustration of the DeepJSCC-f scheme in kurka2020deepjscc (with $m=3$ blocks as an example), where multiple independent encoders and decoders are trained as the channel codes in each block. Bottom: Illustration of our proposed JSCCformer-f scheme, where a single unified encoder-decoder pair is employed at each block, significantly reducing the training and memory requirements as well as the coding complexity.
  • Figure 2: The pipeline of our JSCCformer-f scheme. In the $i$-th transmission block, the ViT-based encoder $E_{\bm{\theta}}$ encodes the channel symbols $\bm{X_i}$ based on the input image and the channel feedback signals received until that time, where $\bm{\hat{Y}_{{i-1}}}$ is the noisy channel feedback for the $i-1$-th transmission and $\bm{Z_{i-1}}$ is its embedding. Note that $(\bm{\hat{Y}_{{i}}}, \ldots, \bm{\hat{Y}_{{m}}})$ and $(\bm{Z_{i}}, \ldots, \bm{Z_{m-1}})$ are padded with zeros as they correspond to future channel feedback signals.
  • Figure 3: The architecture of the encoder and decoder, where a symmetric structure is designed to encode the input sequence and reconstruct the source signal.
  • Figure 4: Performance of different schemes at various SNR values and bandwidth ratios with noiseless feedback, where models in subfigures (a) and (b) undergo assessment within the AWGN channel, and models in subfigure (c) are evaluated in a Rayleigh Fading channel.
  • Figure 5: Performance comparison of different models versus bandwidth ratio $R$ in AWGN channel when SNR$=10$ dB with noiseless feedback.
  • ...and 13 more figures