Table of Contents
Fetching ...

Semantic Feature Decomposition based Semantic Communication System of Images with Large-scale Visual Generation Models

Senran Fan, Zhicheng Bao, Chen Dong, Haotai Liang, Xiaodong Xu, Ping Zhang

TL;DR

A novel paradigm based on Semantic Feature Decomposition (SeFD) for the integration of semantic communication and large-scale visual generation models to achieve high-performance, highly interpretable and controllable image communication is proposed.

Abstract

The end-to-end image communication system has been widely studied in the academic community. The escalating demands on image communication systems in terms of data volume, environmental complexity, and task precision require enhanced communication efficiency, anti-noise ability and semantic fidelity. Therefore, we proposed a novel paradigm based on Semantic Feature Decomposition (SeFD) for the integration of semantic communication and large-scale visual generation models to achieve high-performance, highly interpretable and controllable image communication. According to this paradigm, a Texture-Color based Semantic Communication system of Images TCSCI is proposed. TCSCI decomposing the images into their natural language description (text), texture and color semantic features at the transmitter. During the transmission, features are transmitted over the wireless channel, and at the receiver, a large-scale visual generation model is utilized to restore the image through received features. TCSCI can achieve extremely compressed, highly noise-resistant, and visually similar image semantic communication, while ensuring the interpretability and editability of the transmission process. The experiments demonstrate that the TCSCI outperforms traditional image communication systems and existing semantic communication systems under extreme compression with good anti-noise performance and interpretability.

Semantic Feature Decomposition based Semantic Communication System of Images with Large-scale Visual Generation Models

TL;DR

A novel paradigm based on Semantic Feature Decomposition (SeFD) for the integration of semantic communication and large-scale visual generation models to achieve high-performance, highly interpretable and controllable image communication is proposed.

Abstract

The end-to-end image communication system has been widely studied in the academic community. The escalating demands on image communication systems in terms of data volume, environmental complexity, and task precision require enhanced communication efficiency, anti-noise ability and semantic fidelity. Therefore, we proposed a novel paradigm based on Semantic Feature Decomposition (SeFD) for the integration of semantic communication and large-scale visual generation models to achieve high-performance, highly interpretable and controllable image communication. According to this paradigm, a Texture-Color based Semantic Communication system of Images TCSCI is proposed. TCSCI decomposing the images into their natural language description (text), texture and color semantic features at the transmitter. During the transmission, features are transmitted over the wireless channel, and at the receiver, a large-scale visual generation model is utilized to restore the image through received features. TCSCI can achieve extremely compressed, highly noise-resistant, and visually similar image semantic communication, while ensuring the interpretability and editability of the transmission process. The experiments demonstrate that the TCSCI outperforms traditional image communication systems and existing semantic communication systems under extreme compression with good anti-noise performance and interpretability.

Paper Structure

This paper contains 19 sections, 12 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: The architecture of the proposed paradigm and the end-to-end model-driven image semantic communication paradigm. In the end-to-end model-driven paradigm, the image undergoes semantic feature extraction and reconstruction directly through a neural network architecture of an autoencoder. The extracted semantic features are based on the black-box-like neural network computation, and the entire process lacks interpretability, editability, and task generalization ability. Whereas in the paradigm we propose, the image is decomposed into different meaningful semantic features, which are then separately encoded, transmitted, and decoded, and at the receiver, they are used to control a large-scale visual generation model to generate the image with specific semantic features.
  • Figure 2: The overall system architecture of TCSCI. The architecture can be divided into three parts based on the communication workflow: A. Semantic Feature Extraction Module, B. Semantic Feature Transmission Module, C. Image Restoration Module. In module A, the image is decomposed into natural language descriptions, color semantic features, and texture semantic features through the BLIP model, downsampling, and LBP algorithm. These features are further compressed in module B, transmitted through the wireless channel, and restored at the receiver. The restored features in module C control the ControlNet to drive the Stable Diffusion for specific semantic-controlled image generation. By overlaying the color and texture features, high visual similarity images are obtained to complete the entire transmission process.
  • Figure 3: The process of BLIP model. The natural language descriptions are obtained by inputting the image into the BLIP model. In the BLIP model, the Image Encoder divides the input image into patches and encodes them into a series of Image Embeddings. These embeddings are then fed into the Image-grounded Text Decoder to generate the corresponding descriptions.
  • Figure 4: Example of the calculation process of the LBP algorithm. For each pixel in the target image, extract the grayscale values of all pixels in its neighborhood (using a size of 3x3 as an example). As shown in the matrix of the Figure, compare these grayscale values with the grayscale value of the pixel itself. If the value is greater, assign 1; if it is smaller, assign 0. Arrange these 0 and 1 values in a specific order to obtain a binary number. The corresponding decimal value of this binary number represents the grayscale value of the LBP texture feature map at that pixel.
  • Figure 5: The structure of the Semantic Transmission Module. The semantic transmission module is inspired by the model-driven image semantic system LSCI and consists of two parts: the base model and the channel component-model. In the base model, semantic feature maps are further compressed into semantic latent encoding through convolutional neural networks and are then reconstructed in the semantic decoder. The channel component-model introduces the source-channel joint coding technique to reduce the impact of noise on semantic feature maps during wireless channel transmission, thereby improving the system's anti-noise ability.
  • ...and 8 more figures