Vision Transformer Based Semantic Communications for Next Generation Wireless Networks

Muhammad Ahmed Mohsin; Muhammad Jazib; Zeeshan Alam; Muhmmad Farhan Khan; Muhammad Saad; Muhammad Ali Jamshed

Vision Transformer Based Semantic Communications for Next Generation Wireless Networks

Muhammad Ahmed Mohsin, Muhammad Jazib, Zeeshan Alam, Muhmmad Farhan Khan, Muhammad Saad, Muhammad Ali Jamshed

TL;DR

This paper addresses efficient semantic communications for next-generation wireless networks by transmitting semantic content rather than exact data. It introduces a Vision Transformer (ViT)–based encoder–decoder for end-to-end semantic image transmission that remains robust under realistic fading and noise. Empirical results show ViT-based semantic transmission achieving higher PSNR and SSIM while reducing bandwidth consumption, outperforming CNN and GAN baselines across multiple datasets and channel models, with PSNR approaching 38 dB. The work demonstrates strong potential for bandwidth-efficient, high-fidelity semantic communication and suggests avenues for future hybrid and edge-deployed implementations.

Abstract

In the evolving landscape of 6G networks, semantic communications are poised to revolutionize data transmission by prioritizing the transmission of semantic meaning over raw data accuracy. This paper presents a Vision Transformer (ViT)-based semantic communication framework that has been deliberately designed to achieve high semantic similarity during image transmission while simultaneously minimizing the demand for bandwidth. By equipping ViT as the encoder-decoder framework, the proposed architecture can proficiently encode images into a high semantic content at the transmitter and precisely reconstruct the images, considering real-world fading and noise consideration at the receiver. Building on the attention mechanisms inherent to ViTs, our model outperforms Convolution Neural Network (CNNs) and Generative Adversarial Networks (GANs) tailored for generating such images. The architecture based on the proposed ViT network achieves the Peak Signal-to-noise Ratio (PSNR) of 38 dB, which is higher than other Deep Learning (DL) approaches in maintaining semantic similarity across different communication environments. These findings establish our ViT-based approach as a significant breakthrough in semantic communications.

Vision Transformer Based Semantic Communications for Next Generation Wireless Networks

TL;DR

Abstract

Vision Transformer Based Semantic Communications for Next Generation Wireless Networks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)