Table of Contents
Fetching ...

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Jiyao Liu, Jinjie Wei, Wanying Qu, Chenglong Ma, Junzhi Ning, Yunheng Li, Ying Chen, Xinzhe Luo, Pengcheng Chen, Xin Gao, Ming Hu, Huihui Xu, Xin Wang, Shujian Gao, Dingkang Yang, Zhongying Deng, Jin Ye, Lihao Liu, Junjun He, Ningsheng Xu

TL;DR

MedQ-Bench presents a perception–reasoning benchmark to evaluate medical image quality assessment abilities of multimodal large language models across 5 modalities and 40+ degradations. It decouples perception (Yes-No, What, How) from reasoning (no-reference and paired comparisons) and introduces a four-dimension scoring protocol aligned with radiologist judgments. Large-scale zero-shot evaluations of 14 MLLMs reveal substantial gaps between AI models and human experts, with general-purpose models often outperforming medical-specialized ones and fine-grained degradations proving particularly challenging. The work provides a clinically grounded framework for developing trustworthy MLLMs capable of nuanced IQA reasoning, potentially enabling safer automated quality control in clinical imaging workflows.

Abstract

Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

TL;DR

MedQ-Bench presents a perception–reasoning benchmark to evaluate medical image quality assessment abilities of multimodal large language models across 5 modalities and 40+ degradations. It decouples perception (Yes-No, What, How) from reasoning (no-reference and paired comparisons) and introduces a four-dimension scoring protocol aligned with radiologist judgments. Large-scale zero-shot evaluations of 14 MLLMs reveal substantial gaps between AI models and human experts, with general-purpose models often outperforming medical-specialized ones and fine-grained degradations proving particularly challenging. The work provides a clinically grounded framework for developing trustworthy MLLMs capable of nuanced IQA reasoning, potentially enabling safer automated quality control in clinical imaging workflows.

Abstract

Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

Paper Structure

This paper contains 48 sections, 1 equation, 13 figures, 10 tables.

Figures (13)

  • Figure 1: MedQ-Bench overview, evaluating MLLMs’ abilities in medical image quality assessment with: (1) Comprehensive coverage: 3,308 samples across 5 modalities with 40+ degradation types. (2) Multi-faceted evaluation: perception-reasoning paradigm.
  • Figure 2: Comparison of Reasoning IQA with score-based IQA. Unlike purely numerical scores, Reasoning IQA identifies distortion types and their relative impact, yielding results more consistent with human judgment.
  • Figure 3: Examples of question types in MedQ-Bench, covering MCQA perception tasks (Yes-No / What / How), open-ended reasoning, and pair/multi-image comparison.
  • Figure 4: Overall Performance Results
  • Figure 5: Performance analysis of MLLMs across different evaluation dimensions. (a) Different degradation level performance . (b) General vs modality-specific question.
  • ...and 8 more figures