Table of Contents
Fetching ...

Similarity and Quality Metrics for MR Image-To-Image Translation

Melanie Dohmen, Mark A. Klemens, Ivo M. Baltruschat, Tuan Truong, Matthias Lenga

TL;DR

This study addresses the challenge of validating MR image‑to‑image translation by systematically benchmarking a broad suite of metrics. It compares 11 reference and 12 non‑reference metrics across 11 MR‑specific distortions and five normalization strategies, adding a downstream segmentation task to assess task‑relevant fidelity. The results reveal that SSIM and PSNR have notable weaknesses for MR synthesis, while metrics like CW‑SSIM, MSLC, NMI, and texture‑focused measures can better capture distortions; normalization choices substantially influence many scores. The authors provide practical recommendations on metric selection and reporting to improve reliability, comparability, and clinical relevance of MR image translation evaluations.

Abstract

Image-to-image translation can create large impact in medical imaging, as images can be synthetically transformed to other modalities, sequence types, higher resolutions or lower noise levels. To ensure patient safety, these methods should be validated by human readers, which requires a considerable amount of time and costs. Quantitative metrics can effectively complement such studies and provide reproducible and objective assessment of synthetic images. If a reference is available, the similarity of MR images is frequently evaluated by SSIM and PSNR metrics, even though these metrics are not or too sensitive regarding specific distortions. When reference images to compare with are not available, non-reference quality metrics can reliably detect specific distortions, such as blurriness. To provide an overview on distortion sensitivity, we quantitatively analyze 11 similarity (reference) and 12 quality (non-reference) metrics for assessing synthetic images. We additionally include a metric on a downstream segmentation task. We investigate the sensitivity regarding 11 kinds of distortions and typical MR artifacts, and analyze the influence of different normalization methods on each metric and distortion. Finally, we derive recommendations for effective usage of the analyzed similarity and quality metrics for evaluation of image-to-image translation models.

Similarity and Quality Metrics for MR Image-To-Image Translation

TL;DR

This study addresses the challenge of validating MR image‑to‑image translation by systematically benchmarking a broad suite of metrics. It compares 11 reference and 12 non‑reference metrics across 11 MR‑specific distortions and five normalization strategies, adding a downstream segmentation task to assess task‑relevant fidelity. The results reveal that SSIM and PSNR have notable weaknesses for MR synthesis, while metrics like CW‑SSIM, MSLC, NMI, and texture‑focused measures can better capture distortions; normalization choices substantially influence many scores. The authors provide practical recommendations on metric selection and reporting to improve reliability, comparability, and clinical relevance of MR image translation evaluations.

Abstract

Image-to-image translation can create large impact in medical imaging, as images can be synthetically transformed to other modalities, sequence types, higher resolutions or lower noise levels. To ensure patient safety, these methods should be validated by human readers, which requires a considerable amount of time and costs. Quantitative metrics can effectively complement such studies and provide reproducible and objective assessment of synthetic images. If a reference is available, the similarity of MR images is frequently evaluated by SSIM and PSNR metrics, even though these metrics are not or too sensitive regarding specific distortions. When reference images to compare with are not available, non-reference quality metrics can reliably detect specific distortions, such as blurriness. To provide an overview on distortion sensitivity, we quantitatively analyze 11 similarity (reference) and 12 quality (non-reference) metrics for assessing synthetic images. We additionally include a metric on a downstream segmentation task. We investigate the sensitivity regarding 11 kinds of distortions and typical MR artifacts, and analyze the influence of different normalization methods on each metric and distortion. Finally, we derive recommendations for effective usage of the analyzed similarity and quality metrics for evaluation of image-to-image translation models.
Paper Structure (43 sections, 54 equations, 18 figures, 5 tables)

This paper contains 43 sections, 54 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Overview of image-to-image translation and types of evaluation metrics. (1) A source image from a source domain is transformed to a prediction in the target domain by an image-to-image translation model. If a reference image is given, this also belongs to the target domain. Then there are multiple possibilities to apply metrics. (A) Reference metrics directly compare prediction and reference image. (B) Non-reference metrics can be applied to the prediction alone, but also - if available - to a reference image. Then both non-reference metric scores can be compared. As an additional option (C), the reference and the prediction can be further processed in a downstream task, i.e. a segmentation task as a second (2) step. The performance of both downstream task results is then assessed with a downstream task metric, i.e. a segmentation metric.
  • Figure 2: Workflow of experiments. (1) 100 reference images were distorted with one of 11 distortions (see Sec. \ref{['sec:distortions']}) with one of five strengths. (2) The distorted images and the reference images were individually normalized with one of six normalization methods (see Sec. \ref{['sec:normalization_methods']}), including no normalization and omitting piece-wise linear (PL) normalization, which depends on a reference dataset. (A) Reference and (B) non-Reference metric scores were obtained from normalized distorted images and normalized reference images. (3) The segmentation model was applied to one normalization method only, because the fully automatic segmentation setup integrated all preprocessing steps, including Zscore normalization.
  • Figure 3: Examples of distorted images for lowest strength $s=1$, up to the maximal distortion strength $s=5$. For $s=1$, the distortions are hardly visible and therefore the images appear all the same. All distorted images are displayed with the same intensity range as the reference image, i.e. the range was clipped in case of higher or lower values. The change in the image distorted by ghosting is highlighted by a green arrow. Further examples of distorted images are provided in the Supplementary Figs. S.2-S.7
  • Figure S.1: Mean category assigned by the six readers for 20 initial parameter settings for each distortion.
  • Figure S.2: Distorted versions of BraTS-GLI-00005-000-t1c, visualized with the intensity range of the reference image.
  • ...and 13 more figures