A study of why we need to reassess full reference image quality assessment with medical images

Anna Breger; Ander Biguri; Malena Sabaté Landman; Ian Selby; Nicole Amberg; Elisabeth Brunner; Janek Gröhl; Sepideh Hatamikia; Clemens Karner; Lipeng Ning; Sören Dittmer; Michael Roberts; AIX-COVNET Collaboration; Carola-Bibiane Schönlieb

A study of why we need to reassess full reference image quality assessment with medical images

Anna Breger, Ander Biguri, Malena Sabaté Landman, Ian Selby, Nicole Amberg, Elisabeth Brunner, Janek Gröhl, Sepideh Hatamikia, Clemens Karner, Lipeng Ning, Sören Dittmer, Michael Roberts, AIX-COVNET Collaboration, Carola-Bibiane Schönlieb

TL;DR

This paper interrogates the suitability of full-reference IQA measures, notably $PSNR$, $SSIM$, and $LPIPS$, for medical imaging across CT, MRI, X-ray, OCT, digital pathology, and photoacoustic modalities. It demonstrates through modality-specific examples that standard FR-IQA rankings often conflict with clinical and perceptual quality, especially in reconstruction, post-processing, and acquisition settings. The authors argue for task-informed FR-IQA, the development of task-specific metrics, and better reporting and sharing of data and evaluation frameworks to bridge the gap between algorithm development and clinical applicability. The work provides a structured synthesis of failure patterns and proposes guidelines and future research directions to improve reliability and explainability in medical image quality assessment. The overarching message is that FR-IQA should be used with caution in medical imaging, complemented by NR-IQA where appropriate, and guided by explicit task requirements and standardized reporting.

Abstract

Image quality assessment (IQA) is indispensable in clinical practice to ensure high standards, as well as in the development stage of machine learning algorithms that operate on medical images. The popular full reference (FR) IQA measures PSNR and SSIM are known and tested for working successfully in many natural imaging tasks, but discrepancies in medical scenarios have been reported in the literature, highlighting the gap between development and actual clinical application. Such inconsistencies are not surprising, as medical images have very different properties than natural images, and PSNR and SSIM have neither been targeted nor properly tested for medical images. This may cause unforeseen problems in clinical applications due to wrong judgment of novel methods. This paper provides a structured and comprehensive overview of examples where PSNR and SSIM prove to be unsuitable for the assessment of novel algorithms using different kinds of medical images, including real-world MRI, CT, OCT, X-Ray, digital pathology and photoacoustic imaging data. Therefore, improvement is urgently needed in particular in this era of AI to increase reliability and explainability in machine learning for medical imaging and beyond. Lastly, we will provide ideas for future research as well as suggesting guidelines for the usage of FR-IQA measures applied to medical images.

A study of why we need to reassess full reference image quality assessment with medical images

TL;DR

This paper interrogates the suitability of full-reference IQA measures, notably

, and

, for medical imaging across CT, MRI, X-ray, OCT, digital pathology, and photoacoustic modalities. It demonstrates through modality-specific examples that standard FR-IQA rankings often conflict with clinical and perceptual quality, especially in reconstruction, post-processing, and acquisition settings. The authors argue for task-informed FR-IQA, the development of task-specific metrics, and better reporting and sharing of data and evaluation frameworks to bridge the gap between algorithm development and clinical applicability. The work provides a structured synthesis of failure patterns and proposes guidelines and future research directions to improve reliability and explainability in medical image quality assessment. The overarching message is that FR-IQA should be used with caution in medical imaging, complemented by NR-IQA where appropriate, and guided by explicit task requirements and standardized reporting.

Abstract

Paper Structure (38 sections, 3 equations, 12 figures)

This paper contains 38 sections, 3 equations, 12 figures.

Introduction
Outline
Background
Illustration of the failure of common FR-IQA measures on synthetic image degradations
Examples of failure in medical imaging
Computed Tomography
Reconstruction problem
Example 1: Krylov methods in CBCT
Example 2: data driven reconstruction methods in lung CT screening
Example 3: Scanner settings impact in IQ
MRI
Reconstruction Problem
Example 1: Scan acceleration
Example 2: Diffusion-weighted MRI (dMRI)
X-Ray
...and 23 more sections

Figures (12)

Figure 1: Illustrative toy example of problems occuring when using the standard FR-IQA measures PSNR/SSIM/LPIPS for the evaluation of medical images. Degradations have been added artificially to the reference (a) MRI scan: contrast enhancement (b), brightness change (c), hole (d), Gaussian White noise (e), jpeg compression (f). PSNR yields the same value for all degradations, SSIM and LPIPS fail to identify the hole (d), and misjudge the quality of (e) and (f).
Figure 2: CBCT Reconstructions from different Krylov methods (b)-(h) of phantom head data, and PSNR/SSIM/LPIPS values compared to the ground truth (a). The overall visual appearance is misjudged here by all three measures, e.g. PSNR in (g), SSIM in (e) and LPIPS in (g).
Figure 3: Reference image (a) and outputs of different reconstruction methods (b)-(f) applied to dose simulated data. PSNR/SSIM/LPIPS are unable to identify the best reconstruction (c), where also the tumour is visualized well.
Figure 4: Comparison of image acquisition settings, (a) reference image with best chosen parameter setting (0.6mm and 120kVp), (b) preserves more detail (0.6mm and 80kVp) than (c) which is more smoothed (2mm and 100kVp). PSNR/SSIM misjudge the visual quality, LPIPS yields reasonable quality scores here.
Figure 5: Reconstruction outputs of accelerated FLAIR MRI data from the algorithms Xpdnet(a)(d) and E2varnet (b)(c)(e)(f). The bottom images (d)-(f) are judged by PSNR/SSIM/LPIPS as better reconstructions than the respective image above them (a)-(c), although they contain stronger blur and contain more ringing artifacts.
...and 7 more figures

A study of why we need to reassess full reference image quality assessment with medical images

TL;DR

Abstract

A study of why we need to reassess full reference image quality assessment with medical images

Authors

TL;DR

Abstract

Table of Contents

Figures (12)