Table of Contents
Fetching ...

How to Evaluate Semantic Communications for Images with ViTScore Metric?

Tingting Zhu, Bo Peng, Jifan Liang, Tingchen Han, Hai Wan, Jingqiao Fu, Junjie Chen

TL;DR

This work targets the lack of suitable metrics for evaluating image semantics in semantic communications. It introduces ViTScore, a global semantic similarity metric derived from a pre-trained Vision Transformer, and proves core properties: symmetry, boundedness, and normalization. Through extensive experiments across classical and semantic image transmission systems, including semantic attacks, ViTScore shows robustness and better alignment with semantic judgments than traditional metrics like PSNR, MS-SSIM, and LPIPS. The results suggest ViTScore as a practical and effective tool for assessing semantic fidelity in image SC, with potential extensions to foundation-model-based SC and video semantics.

Abstract

Semantic communications (SC) have been expected to be a new paradigm shifting to catalyze the next generation communication, whose main concerns shift from accurate bit transmission to effective semantic information exchange in communications. However, the previous and widely-used metrics for images are not applicable to evaluate the image semantic similarity in SC. Classical metrics to measure the similarity between two images usually rely on the pixel level or the structural level, such as the PSNR and the MS-SSIM. Straightforwardly using some tailored metrics based on deep-learning methods in CV community, such as the LPIPS, is infeasible for SC. To tackle this, inspired by BERTScore in NLP community, we propose a novel metric for evaluating image semantic similarity, named Vision Transformer Score (ViTScore). We prove theoretically that ViTScore has 3 important properties, including symmetry, boundedness, and normalization, which make ViTScore convenient and intuitive for image measurement. To evaluate the performance of ViTScore, we compare ViTScore with 3 typical metrics (PSNR, MS-SSIM, and LPIPS) through 4 classes of experiments: (i) correlation with BERTScore through evaluation of image caption downstream CV task, (ii) evaluation in classical image communications, (iii) evaluation in image semantic communication systems, and (iv) evaluation in image semantic communication systems with semantic attack. Experimental results demonstrate that ViTScore is robust and efficient in evaluating the semantic similarity of images. Particularly, ViTScore outperforms the other 3 typical metrics in evaluating the image semantic changes by semantic attack, such as image inverse with Generative Adversarial Networks (GANs). This indicates that ViTScore is an effective performance metric when deployed in SC scenarios.

How to Evaluate Semantic Communications for Images with ViTScore Metric?

TL;DR

This work targets the lack of suitable metrics for evaluating image semantics in semantic communications. It introduces ViTScore, a global semantic similarity metric derived from a pre-trained Vision Transformer, and proves core properties: symmetry, boundedness, and normalization. Through extensive experiments across classical and semantic image transmission systems, including semantic attacks, ViTScore shows robustness and better alignment with semantic judgments than traditional metrics like PSNR, MS-SSIM, and LPIPS. The results suggest ViTScore as a practical and effective tool for assessing semantic fidelity in image SC, with potential extensions to foundation-model-based SC and video semantics.

Abstract

Semantic communications (SC) have been expected to be a new paradigm shifting to catalyze the next generation communication, whose main concerns shift from accurate bit transmission to effective semantic information exchange in communications. However, the previous and widely-used metrics for images are not applicable to evaluate the image semantic similarity in SC. Classical metrics to measure the similarity between two images usually rely on the pixel level or the structural level, such as the PSNR and the MS-SSIM. Straightforwardly using some tailored metrics based on deep-learning methods in CV community, such as the LPIPS, is infeasible for SC. To tackle this, inspired by BERTScore in NLP community, we propose a novel metric for evaluating image semantic similarity, named Vision Transformer Score (ViTScore). We prove theoretically that ViTScore has 3 important properties, including symmetry, boundedness, and normalization, which make ViTScore convenient and intuitive for image measurement. To evaluate the performance of ViTScore, we compare ViTScore with 3 typical metrics (PSNR, MS-SSIM, and LPIPS) through 4 classes of experiments: (i) correlation with BERTScore through evaluation of image caption downstream CV task, (ii) evaluation in classical image communications, (iii) evaluation in image semantic communication systems, and (iv) evaluation in image semantic communication systems with semantic attack. Experimental results demonstrate that ViTScore is robust and efficient in evaluating the semantic similarity of images. Particularly, ViTScore outperforms the other 3 typical metrics in evaluating the image semantic changes by semantic attack, such as image inverse with Generative Adversarial Networks (GANs). This indicates that ViTScore is an effective performance metric when deployed in SC scenarios.
Paper Structure (16 sections, 3 theorems, 21 equations, 11 figures, 6 tables)

This paper contains 16 sections, 3 theorems, 21 equations, 11 figures, 6 tables.

Key Result

Theorem 1

For any two images $A$ and $B$, changing the order of the two images does not change the $\mathrm{ViTScore}$, i.e.,

Figures (11)

  • Figure 1: Comparison of the image semantic similarity evaluation with 4 metrics (PSNR, MS-SSIM, LPIPS and ViTScore). Transmitting an image (a) with a GANs-based SC system model, where the physical channel is supposed to be noiseless, the reconstruction performance (b) is evaluated: PSNR = 14.60, MS-SSIM = 0.72, LPIPS = 0.15, ViTScore = 0.79. While the semantic noise is introduced, which is characterized as the image inverse with GANs, the reconstruction performance (c) is evaluated: PSNR = 5.19, MS-SSIM = 0.00, LPIPS = 0.43, ViTScore= 0.60. The fluctuation in terms of the other 3 metrics is much larger than that of ViTScore. In fact, the semantics between images (a) and (b) (resp. (a) and (c)) is very similar. Hence, the semantic similarities between images (a) and (b) versus those of (a) and (c) are supposed to be close.
  • Figure 2: The model overview of ViTDBLP:conf/iclr/DosovitskiyB0WZ21. The input image is first split into patches with a fixed size. Then the patches are flattened and projected to the embedding space by a linear layer. The position embedding is then added, in order to keep the positional information of patches. The mixed embedding vectors are fed into an L-layer transformer encoder. Each layer of the transformer encoder is shown on the right side of this figure. The outputs of the transformer encoder are semantic features and can be further fed into an MLP classifier. As we use ViT to extract features, our implementation does not consist of the MLP head and the classification output.
  • Figure 3: The comparison of the correlations between PSNR, MS-SSIM, LPIPS, ViTScore and BERTScore over COCO dataset. The negative value of the Pearson correlation coefficient between LPIPS and BERTScore indicates higher similarity for lower LPIPS values. The Pearson correlation coefficient between ViTScore and BERTScore is larger than the absolute value of the coefficient between LPIPS and BERTScore, indicating that ViTScore is more strongly correlated with BERTScore than LPIPS.
  • Figure 4: Comparison of the evaluation examples with BERTScore, PSNR, MS-SSIM, LPIPS and ViTScore. The evaluations between (a) and (b): BERTScore = 0.89, PSNR = 16.23, MS-SSIM = 0.47, LPIPS = 0.56, ViTScore = 0.32. While the evaluations between (c) and (d): BERTScore = 0.93, PSNR = 7.28, MS-SSIM = 0.00, LPIPS = 0.61, ViTScore = 0.66. Intuitively, the semantic similarity between (c) and (d) is higher than that of (a) and (b). Surprisingly, ViTScore performs better than PSNR, MS-SSIM and LPIPS in these cases, which is consistent with BERTScore, agreeing well with human perceptual judgments.
  • Figure 5: The metric evaluation of the average performance of image transmission with the concatenation of JPEG code followed by an assuming capacity-achieving channel coding over an AWGN channel through COCO image dataset. Average reconstruction quality increases gradually with the channel bandwidth ratio ${\rm CBR}$ increasing, as well as the channel environment improving. The performance trends of ViTScore are in line with those of the other 3 typical metrics (PSNR, MS-SSIM, and LPIPS).
  • ...and 6 more figures

Theorems & Definitions (7)

  • Definition 1: ViT
  • Definition 2: ViTScore
  • Theorem 1: Symmetry
  • proof
  • Theorem 2: Boundedness
  • Theorem 3: Normalization
  • proof