Table of Contents
Fetching ...

Five Pitfalls When Assessing Synthetic Medical Images with Reference Metrics

Melanie Dohmen, Tuan Truong, Ivo M. Baltruschat, Matthias Lenga

TL;DR

Five pitfalls that showcase unexpected and probably undesired reference metric scores are selected and strategies to avoid them are discussed.

Abstract

Reference metrics have been developed to objectively and quantitatively compare two images. Especially for evaluating the quality of reconstructed or compressed images, these metrics have shown very useful. Extensive tests of such metrics on benchmarks of artificially distorted natural images have revealed which metric best correlate with human perception of quality. Direct transfer of these metrics to the evaluation of generative models in medical imaging, however, can easily lead to pitfalls, because assumptions about image content, image data format and image interpretation are often very different. Also, the correlation of reference metrics and human perception of quality can vary strongly for different kinds of distortions and commonly used metrics, such as SSIM, PSNR and MAE are not the best choice for all situations. We selected five pitfalls that showcase unexpected and probably undesired reference metric scores and discuss strategies to avoid them.

Five Pitfalls When Assessing Synthetic Medical Images with Reference Metrics

TL;DR

Five pitfalls that showcase unexpected and probably undesired reference metric scores are selected and strategies to avoid them are discussed.

Abstract

Reference metrics have been developed to objectively and quantitatively compare two images. Especially for evaluating the quality of reconstructed or compressed images, these metrics have shown very useful. Extensive tests of such metrics on benchmarks of artificially distorted natural images have revealed which metric best correlate with human perception of quality. Direct transfer of these metrics to the evaluation of generative models in medical imaging, however, can easily lead to pitfalls, because assumptions about image content, image data format and image interpretation are often very different. Also, the correlation of reference metrics and human perception of quality can vary strongly for different kinds of distortions and commonly used metrics, such as SSIM, PSNR and MAE are not the best choice for all situations. We selected five pitfalls that showcase unexpected and probably undesired reference metric scores and discuss strategies to avoid them.
Paper Structure (12 sections, 2 equations, 5 figures)

This paper contains 12 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: An example reference image (a) and its gamma and linearly transformed version (b) are shown. Mean similarity scores over 100 images are listed in (c). The results reveal strong influence of normalization parameters and methods.
  • Figure 2: Small misalignments have strong influence to all reference metrics. Only DISTS and CW-SSIM are less sensitive to small geometric transformations.
  • Figure 3: An example reference image (a), its by 3% cropped version (b), its by a bounded box cropped version (c) and an exactly foreground masking version (d) are shown. Mean similarity scores over 100 images are listed in (e). With less identical background included in the calculation, the assessed similarity strongly decreases.
  • Figure 4: An example region of interest with different distortions is shown in the first row: (a) reference, (b) stripes added, (c) Gaussian noise added, (d) lower half of the image replaced by mirror of the upper half. Mean similarity scores were assessed over 100 images (e). Blurring perceptually improves strong distortions and quantitatively improves most similarity scores, especially SSIM. Out of all observed metrics, NMI best detects blurring.
  • Figure 5: An example of a reference image (a) and a version with replacements (b), as well as their respective tumor segmentations (c, d) are shown. Specifically, the lower half of the reference images are replaced by the mirrored upper half. The mean similarity scores over 100 images are assessed by different metrics (e). While most similarity metrics hardly change with artificial introduction or removal of a tumor, additional or missing tumor segmentations strongly decrease the DICE score.