Table of Contents
Fetching ...

Image Generative Semantic Communication with Multi-Modal Similarity Estimation for Resource-Limited Networks

Eri Hosonuma, Taku Yamazaki, Takumi Miyoshi, Akihito Taya, Yuuki Nishiyama, Kaoru Sezaki

TL;DR

This work tackles the challenge of transmitting images over resource-limited networks by shifting from pixel-wise transmission to semantic-information-based transmission. It introduces a multi-modal approach that extracts descriptive captions, segmentation maps, and color palettes from the original image, sending only these features and using an image-generation model at the receiver to produce multiple candidate images. The receiver selects the best output using two proposed semantic similarity scoring procedures, ensuring the generated image aligns with the original's content and composition, while a background recoloring technique enhances object contours. Experimental results demonstrate substantial data-size reductions (captions ~1/270 and caption+palette+seg ~1/14 of JPEG) and improved semantic fidelity compared with caption-only methods, validating the practical potential for low-bandwidth scenarios and opening avenues for video/real-time extensions.

Abstract

To reduce network traffic and support environments with limited resources, a method for transmitting images with minimal transmission data is required. Several machine learning-based image compression methods, which compress the data size of images while maintaining their features, have been proposed. However, in certain situations, reconstructing only the semantic information of images at the receiver end may be sufficient. To realize this concept, semantic-information-based communication, called semantic communication, has been proposed, along with an image transmission method using semantic communication. This method transmits only the semantic information of an image, and the receiver reconstructs it using an image-generation model. This method utilizes a single type of semantic information for image reconstruction, but reconstructing images similar to the original image using only this information is challenging. This study proposes a multi-modal image transmission method that leverages various types of semantic information for efficient semantic communication. The proposed method extracts multi-modal semantic information from an original image and transmits only that to a receiver. Subsequently, the receiver generates multiple images using an image-generation model and selects an output image based on semantic similarity. The receiver must select the result based only on the received features; however, evaluating the similarity using conventional metrics is challenging. Therefore, this study explores new metrics to evaluate the similarity between semantic features of images and proposes two scoring procedures for evaluating semantic similarity between images based on multiple semantic features. The results indicate that the proposed procedures can compare semantic similarities, such as position and composition, between the semantic features of the original and generated images.

Image Generative Semantic Communication with Multi-Modal Similarity Estimation for Resource-Limited Networks

TL;DR

This work tackles the challenge of transmitting images over resource-limited networks by shifting from pixel-wise transmission to semantic-information-based transmission. It introduces a multi-modal approach that extracts descriptive captions, segmentation maps, and color palettes from the original image, sending only these features and using an image-generation model at the receiver to produce multiple candidate images. The receiver selects the best output using two proposed semantic similarity scoring procedures, ensuring the generated image aligns with the original's content and composition, while a background recoloring technique enhances object contours. Experimental results demonstrate substantial data-size reductions (captions ~1/270 and caption+palette+seg ~1/14 of JPEG) and improved semantic fidelity compared with caption-only methods, validating the practical potential for low-bandwidth scenarios and opening avenues for video/real-time extensions.

Abstract

To reduce network traffic and support environments with limited resources, a method for transmitting images with minimal transmission data is required. Several machine learning-based image compression methods, which compress the data size of images while maintaining their features, have been proposed. However, in certain situations, reconstructing only the semantic information of images at the receiver end may be sufficient. To realize this concept, semantic-information-based communication, called semantic communication, has been proposed, along with an image transmission method using semantic communication. This method transmits only the semantic information of an image, and the receiver reconstructs it using an image-generation model. This method utilizes a single type of semantic information for image reconstruction, but reconstructing images similar to the original image using only this information is challenging. This study proposes a multi-modal image transmission method that leverages various types of semantic information for efficient semantic communication. The proposed method extracts multi-modal semantic information from an original image and transmits only that to a receiver. Subsequently, the receiver generates multiple images using an image-generation model and selects an output image based on semantic similarity. The receiver must select the result based only on the received features; however, evaluating the similarity using conventional metrics is challenging. Therefore, this study explores new metrics to evaluate the similarity between semantic features of images and proposes two scoring procedures for evaluating semantic similarity between images based on multiple semantic features. The results indicate that the proposed procedures can compare semantic similarities, such as position and composition, between the semantic features of the original and generated images.
Paper Structure (20 sections, 3 equations, 15 figures, 1 table)

This paper contains 20 sections, 3 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Overview of the proposed image transmission method. A transmitter generates multiple features from an image for extracting semantic information and transmits only the features to achieve data reduction. A receiver receives the extracted features and reconstructs the image using an image-generation model.
  • Figure 2: Procedure for extracting semantic information from the original image using the transmitter. The transmitter generates a caption, segmentation array, and color palette from the original image, representing its descriptive, segment, and color information.
  • Figure 3: Procedure for reconstructing the image at the receiver using the image-generation model. The receiver generates multiple images by inputting the received caption and generated segmented-colored image into an image-generation model. After the generation, the receiver selects an output image based on the semantic similarity between the received features and those of the generated images.
  • Figure 4: Examples wherein color information affects the segment information recognition of the image-generation model. In the upper example, the image-generation model correctly recognizes the composition of the target object because the boundary between the object and background in the colored-segmented image is clear. In contrast, in the lower example, the positions of the target object in the generated images differ significantly from that in the original image, which may be caused by the similar RGB values of the target object and background in the colored-segmented image.
  • Figure 5: Examples of original images, segmented colored images, and captions used in the experiment.
  • ...and 10 more figures