Similarity and Quality Metrics for MR Image-To-Image Translation
Melanie Dohmen, Mark A. Klemens, Ivo M. Baltruschat, Tuan Truong, Matthias Lenga
TL;DR
This study addresses the challenge of validating MR image‑to‑image translation by systematically benchmarking a broad suite of metrics. It compares 11 reference and 12 non‑reference metrics across 11 MR‑specific distortions and five normalization strategies, adding a downstream segmentation task to assess task‑relevant fidelity. The results reveal that SSIM and PSNR have notable weaknesses for MR synthesis, while metrics like CW‑SSIM, MSLC, NMI, and texture‑focused measures can better capture distortions; normalization choices substantially influence many scores. The authors provide practical recommendations on metric selection and reporting to improve reliability, comparability, and clinical relevance of MR image translation evaluations.
Abstract
Image-to-image translation can create large impact in medical imaging, as images can be synthetically transformed to other modalities, sequence types, higher resolutions or lower noise levels. To ensure patient safety, these methods should be validated by human readers, which requires a considerable amount of time and costs. Quantitative metrics can effectively complement such studies and provide reproducible and objective assessment of synthetic images. If a reference is available, the similarity of MR images is frequently evaluated by SSIM and PSNR metrics, even though these metrics are not or too sensitive regarding specific distortions. When reference images to compare with are not available, non-reference quality metrics can reliably detect specific distortions, such as blurriness. To provide an overview on distortion sensitivity, we quantitatively analyze 11 similarity (reference) and 12 quality (non-reference) metrics for assessing synthetic images. We additionally include a metric on a downstream segmentation task. We investigate the sensitivity regarding 11 kinds of distortions and typical MR artifacts, and analyze the influence of different normalization methods on each metric and distortion. Finally, we derive recommendations for effective usage of the analyzed similarity and quality metrics for evaluation of image-to-image translation models.
