Table of Contents
Fetching ...

Semantic Similarity Score for Measuring Visual Similarity at Semantic Level

Senran Fan, Zhicheng Bao, Chen Dong, Haotai Liang, Xiaodong Xu, Ping Zhang

TL;DR

This work tackles the challenge of evaluating semantic-level information loss in visual semantic communication, where traditional pixel- or structure-based metrics fall short. It proposes SeSS, a graph-based semantic similarity metric built from SAM segmentation, Scene Graph Generation, and graph matching, anchored by ClipScore and refined through extensive human-annotated tuning. The method is validated on COCO-derived data across compression, noise, generated content, and transformations, demonstrating stronger alignment with human semantic perception than conventional metrics. The resulting SeSS offers a structured, interpretable tool for evaluating semantic-level fidelity in visual communications and guiding future semantic compression and transmission approaches.

Abstract

Semantic communication, as a revolutionary communication architecture, is considered a promising novel communication paradigm. Unlike traditional symbol-based error-free communication systems, semantic-based visual communication systems extract, compress, transmit, and reconstruct images at the semantic level. However, widely used image similarity evaluation metrics, whether pixel-based MSE or PSNR or structure-based MS-SSIM, struggle to accurately measure the loss of semantic-level information of the source during system transmission. This presents challenges in evaluating the performance of visual semantic communication systems, especially when comparing them with traditional communication systems. To address this, we propose a semantic evaluation metric -- SeSS (Semantic Similarity Score), based on Scene Graph Generation and graph matching, which shifts the similarity scores between images into semantic-level graph matching scores. Meanwhile, semantic similarity scores for tens of thousands of image pairs are manually annotated to fine-tune the hyperparameters in the graph matching algorithm, aligning the metric more closely with human semantic perception. The performance of the SeSS is tested on different datasets, including (1)images transmitted by traditional and semantic communication systems at different compression rates, (2)images transmitted by traditional and semantic communication systems at different signal-to-noise ratios, (3)images generated by large-scale model with different noise levels introduced, and (4)cases of images subjected to certain special transformations. The experiments demonstrate the effectiveness of SeSS, indicating that the metric can measure the semantic-level differences in semantic-level information of images and can be used for evaluation in visual semantic communication systems.

Semantic Similarity Score for Measuring Visual Similarity at Semantic Level

TL;DR

This work tackles the challenge of evaluating semantic-level information loss in visual semantic communication, where traditional pixel- or structure-based metrics fall short. It proposes SeSS, a graph-based semantic similarity metric built from SAM segmentation, Scene Graph Generation, and graph matching, anchored by ClipScore and refined through extensive human-annotated tuning. The method is validated on COCO-derived data across compression, noise, generated content, and transformations, demonstrating stronger alignment with human semantic perception than conventional metrics. The resulting SeSS offers a structured, interpretable tool for evaluating semantic-level fidelity in visual communications and guiding future semantic compression and transmission approaches.

Abstract

Semantic communication, as a revolutionary communication architecture, is considered a promising novel communication paradigm. Unlike traditional symbol-based error-free communication systems, semantic-based visual communication systems extract, compress, transmit, and reconstruct images at the semantic level. However, widely used image similarity evaluation metrics, whether pixel-based MSE or PSNR or structure-based MS-SSIM, struggle to accurately measure the loss of semantic-level information of the source during system transmission. This presents challenges in evaluating the performance of visual semantic communication systems, especially when comparing them with traditional communication systems. To address this, we propose a semantic evaluation metric -- SeSS (Semantic Similarity Score), based on Scene Graph Generation and graph matching, which shifts the similarity scores between images into semantic-level graph matching scores. Meanwhile, semantic similarity scores for tens of thousands of image pairs are manually annotated to fine-tune the hyperparameters in the graph matching algorithm, aligning the metric more closely with human semantic perception. The performance of the SeSS is tested on different datasets, including (1)images transmitted by traditional and semantic communication systems at different compression rates, (2)images transmitted by traditional and semantic communication systems at different signal-to-noise ratios, (3)images generated by large-scale model with different noise levels introduced, and (4)cases of images subjected to certain special transformations. The experiments demonstrate the effectiveness of SeSS, indicating that the metric can measure the semantic-level differences in semantic-level information of images and can be used for evaluation in visual semantic communication systems.
Paper Structure (6 sections, 25 equations, 17 figures, 2 tables)

This paper contains 6 sections, 25 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: The existing image similarity metrics can be arranged according to the information level they focus on, with the metrics on the left focusing more on structure and pixel-level differences between images, and the metrics on the right focusing more on semantic-level differences between images. Among these, MSE and PSNR are typical pixel-level metrics, SSIM and MS-SSIM are structure-level metrics, while LPIPS, ViTScore and ClipScore can be considered semantic-level metrics.
  • Figure 2: The architecture of SeSS. Images are segmented into object-level masks by model SAMsam. Then SGG model get object-relation graphs based on images and masks. The matching score between graphs calculated by the graph matching algorithm shows the similarity score between images.
  • Figure 3: The process of converting images into object-relation graphs based on model SAM and PSG. The PSG model using transformer encoder and self-attention technology to get predict relation matrix from masked tokens given by SAM. Using cross entropy between the ground truth matrix and the predict one, PSG model learns to predict relationshipe of objects masked by SAM.
  • Figure 4: Visual example of calculating node similarity between $u$ in $G_1$ and $v$ in $G_2$. The node similarity is calculated by the similarities of their neighboring nodes, and their relevant relation lables. The problem is transformed into a bipartite graph maximum matching problem in matrix form, which is solved using the KM algorithm to obtain the similarity score.
  • Figure 5: Visual example of human-annotated similarity scores of image pairs. The leftmost image is the original image, and the two images to the right are similar images to the original. Human annotations have been provided to respectively indicate the similarity between the two similar images and the original image.
  • ...and 12 more figures