Table of Contents
Fetching ...

SAMScore: A Content Structural Similarity Metric for Image Translation Evaluation

Yunxiang Li, Meixu Chen, Kai Wang, Jun Ma, Alan C. Bovik, You Zhang

TL;DR

SAMScore introduces a universal content-structural similarity metric for image translation that leverages the Segment Anything Model to embed source and translated images into high-level structure spaces. By computing spatial cosine similarity across SAM embeddings and averaging, SAMScore robustly captures content-faithfulness, excelling in the presence of geometric distortions and noise while outperforming FCNScore and ViTScore across 19 tasks. The results highlight that SAMScore better aligns with structural fidelity and perceptual quality, offering a practical tool to evaluate and guide image translation models toward preserving content structures in diverse domains. This approach promises to accelerate progress in art, medical imaging, and other applications by enabling more precise, structure-focused evaluations and potentially guiding SAMScore-informed model optimization.

Abstract

Image translation has wide applications, such as style transfer and modality conversion, usually aiming to generate images having both high degrees of realism and faithfulness. These problems remain difficult, especially when it is important to preserve content structures. Traditional image-level similarity metrics are of limited use, since the content structures of an image are high-level, and not strongly governed by pixel-wise faithfulness to an original image. To fill this gap, we introduce SAMScore, a generic content structural similarity metric for evaluating the faithfulness of image translation models. SAMScore is based on the recent high-performance Segment Anything Model (SAM), which allows content similarity comparisons with standout accuracy. We applied SAMScore on 19 image translation tasks, and found that it is able to outperform all other competitive metrics on all tasks. We envision that SAMScore will prove to be a valuable tool that will help to drive the vibrant field of image translation, by allowing for more precise evaluations of new and evolving translation models. The code is available at https://github.com/Kent0n-Li/SAMScore.

SAMScore: A Content Structural Similarity Metric for Image Translation Evaluation

TL;DR

SAMScore introduces a universal content-structural similarity metric for image translation that leverages the Segment Anything Model to embed source and translated images into high-level structure spaces. By computing spatial cosine similarity across SAM embeddings and averaging, SAMScore robustly captures content-faithfulness, excelling in the presence of geometric distortions and noise while outperforming FCNScore and ViTScore across 19 tasks. The results highlight that SAMScore better aligns with structural fidelity and perceptual quality, offering a practical tool to evaluate and guide image translation models toward preserving content structures in diverse domains. This approach promises to accelerate progress in art, medical imaging, and other applications by enabling more precise, structure-focused evaluations and potentially guiding SAMScore-informed model optimization.

Abstract

Image translation has wide applications, such as style transfer and modality conversion, usually aiming to generate images having both high degrees of realism and faithfulness. These problems remain difficult, especially when it is important to preserve content structures. Traditional image-level similarity metrics are of limited use, since the content structures of an image are high-level, and not strongly governed by pixel-wise faithfulness to an original image. To fill this gap, we introduce SAMScore, a generic content structural similarity metric for evaluating the faithfulness of image translation models. SAMScore is based on the recent high-performance Segment Anything Model (SAM), which allows content similarity comparisons with standout accuracy. We applied SAMScore on 19 image translation tasks, and found that it is able to outperform all other competitive metrics on all tasks. We envision that SAMScore will prove to be a valuable tool that will help to drive the vibrant field of image translation, by allowing for more precise evaluations of new and evolving translation models. The code is available at https://github.com/Kent0n-Li/SAMScore.
Paper Structure (18 sections, 3 equations, 9 figures, 9 tables)

This paper contains 18 sections, 3 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Overview of SAMScore. The source and the generated images are separately input to the SAM encoder to obtain embeddings of the content structures. A spatial cosine similarity score is then calculated, yielding the final SAMScore.
  • Figure 2: Similarity scores between the original images and CycleGAN-generated images with varying degrees of deformation. (A) horse to zebra, (B) orange to apple, (C) cityscapes (label to photo), (D) head (MR to CT), (E) photo to Ukiyoe, and (F) photo to Monet. Column 2 shows the results without additional distortion. Given the differing scales of the metrics, we converted each metric to a percentage change relative to the initial value (results without any perturbation) before generating the line plots. We used dashed lines to represent trends that deviate from the ideal direction of change.
  • Figure 3: Similarity scores between the original images and CycleGAN-generated images with varying degrees of Gaussian noise corruption. (A) horse to zebra, (B) orange to apple, (C) cityscapes (label to photo), (D) head (MR to CT), (E) photo to Ukiyoe, and (F) photo to Monet. Column 2 shows results without additional distortion. Given the differing scales of the metrics, we converted each metric to a percentage change relative to the initial value (results without any perturbation) before generating the line plots.
  • Figure 4: FCNScore and SAMScore: Piecewise affine deformations and Gaussian noises were applied to the generated images, then segmentation was performed by a DeepLabV3Plus that had been trained on target real images. The content shape structural similarity of the generated images was then evaluated by applying accuracy and IoU metrics. Column 2 shows results without added distortion. Given the differing scales of the metrics, we converted each metric to a percentage change relative to the initial value (results without any perturbation) before generating the line plots. We used dashed lines to represent trends that deviate from the ideal direction of change.
  • Figure 5: FCNScore and SAMScore: Piecewise affine deformations were applied to the generated images of the cityscapes (label to photo) task, and then segmentation was performed by a FCN that had been trained on 'true' target images. The content shape structural similarity of the generated images was then evaluated by applying accuracy and IoU metrics. Column 2 shows results without added distortion. Row 2 shows the segmentation results of FCN, and row 3 shows the cosine similarity matrix of SAMScore.
  • ...and 4 more figures