Table of Contents
Fetching ...

A study on the adequacy of common IQA measures for medical images

Anna Breger, Clemens Karner, Ian Selby, Janek Gröhl, Sören Dittmer, Edward Lilley, Judith Babar, Jake Beckford, Thomas R Else, Timothy J Sadler, Shahab Shahipasand, Arthikkaa Thavakumar, Michael Roberts, Carola-Bibiane Schönlieb

TL;DR

Addresses whether common IQA measures, developed for natural images, adequately reflect expert judgments in medical images. The study evaluates full-reference and no-reference measures across grayscale LIVE data, accelerated MRI, photoacoustic reconstructions, and chest X-ray post-processing, using SRCC and KRCC to quantify alignment with human ratings. Key findings show PSNR/SSIM perform poorly for medical data, HaarPSI consistently ranks among the top, and FR measures like FSIM, MS-SSIM, LPIPS, and IW-SSIM offer more robust performance, while natural-image correlations are higher. The work highlights the need for task-tailored IQA metrics and broader data sharing to enable accurate quality assessment in medical imaging.

Abstract

Image quality assessment (IQA) is standard practice in the development stage of novel machine learning algorithms that operate on images. The most commonly used IQA measures have been developed and tested for natural images, but not in the medical setting. Reported inconsistencies arising in medical images are not surprising, as they have different properties than natural images. In this study, we test the applicability of common IQA measures for medical image data by comparing their assessment to manually rated chest X-ray (5 experts) and photoacoustic image data (2 experts). Moreover, we include supplementary studies on grayscale natural images and accelerated brain MRI data. The results of all experiments show a similar outcome in line with previous findings for medical images: PSNR and SSIM in the default setting are in the lower range of the result list and HaarPSI outperforms the other tested measures in the overall performance. Also among the top performers in our experiments are the full reference measures FSIM, LPIPS and MS-SSIM. Generally, the results on natural images yield considerably higher correlations, suggesting that additional employment of tailored IQA measures for medical imaging algorithms is needed.

A study on the adequacy of common IQA measures for medical images

TL;DR

Addresses whether common IQA measures, developed for natural images, adequately reflect expert judgments in medical images. The study evaluates full-reference and no-reference measures across grayscale LIVE data, accelerated MRI, photoacoustic reconstructions, and chest X-ray post-processing, using SRCC and KRCC to quantify alignment with human ratings. Key findings show PSNR/SSIM perform poorly for medical data, HaarPSI consistently ranks among the top, and FR measures like FSIM, MS-SSIM, LPIPS, and IW-SSIM offer more robust performance, while natural-image correlations are higher. The work highlights the need for task-tailored IQA metrics and broader data sharing to enable accurate quality assessment in medical imaging.

Abstract

Image quality assessment (IQA) is standard practice in the development stage of novel machine learning algorithms that operate on images. The most commonly used IQA measures have been developed and tested for natural images, but not in the medical setting. Reported inconsistencies arising in medical images are not surprising, as they have different properties than natural images. In this study, we test the applicability of common IQA measures for medical image data by comparing their assessment to manually rated chest X-ray (5 experts) and photoacoustic image data (2 experts). Moreover, we include supplementary studies on grayscale natural images and accelerated brain MRI data. The results of all experiments show a similar outcome in line with previous findings for medical images: PSNR and SSIM in the default setting are in the lower range of the result list and HaarPSI outperforms the other tested measures in the overall performance. Also among the top performers in our experiments are the full reference measures FSIM, LPIPS and MS-SSIM. Generally, the results on natural images yield considerably higher correlations, suggesting that additional employment of tailored IQA measures for medical imaging algorithms is needed.
Paper Structure (9 sections, 5 figures, 1 table)

This paper contains 9 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The speedyIQA annotation app allows setting a task and rating categories for manual image quality ratings.
  • Figure 2: Reconstructed MRI brain data from the fastMRI data set obtained with E2E-VarNet on the sub-sampled data with acceleration factors $1, 4, 8, 12$ and $16$. The left image corresponds to the reference image obtained via the rSOS of the fully sampled data. The visual quality decreases with the increased acceleration factor.
  • Figure 3: Two examples of the photoacoustic images used, references (a) and reconstructions from three algorithms (b)-(d). Algorithm 1 corrects a reconstructed PA image by using the light fluence obtained from simulations. Algorithm 2 and 3 are deep-learning models trained to estimate the absorption coefficient.
  • Figure 4: Chest X-ray scans with different kinds of post-processing. The image on the left serves as the reference, the other images show lower visual quality.
  • Figure 5: IQA comparison of decreasing MRI reconstruction quality through an increase in acceleration factor (1 to 16). All tested FR measures correctly identify a decrease in quality, two of the tested NR measures (NIQE and PAQ-2-PIQ) struggle to identify the quality loss accurately. The SRCC/KRCC values between the measures and the acceleration categories show corresponding behavior.