A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Tianhe Wu; Kede Ma; Jie Liang; Yujiu Yang; Lei Zhang

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang

TL;DR

This work systematically investigates prompting Multimodal Large Language Models (MLLMs) for Image Quality Assessment (IQA) by pairing psychophysics-inspired testing methods with NLP prompting strategies, and introduces a difficult-sample selection procedure using expert IQA models as proxies. It evaluates three open-source and one closed-source MLLM across full-reference and no-reference IQA tasks over multiple visual attributes, revealing that GPT-4V generally provides the strongest alignment with human perception, but struggles with fine-grained color differences and multi-image comparisons. Open-source MLLMs require model-specific prompting to approach IQA performance, and chain-of-thought prompting offers notable gains for GPT-4V, suggesting that IQA benefits from integrating perceptual analysis into broader reasoning tasks. The study highlights the need for prompt optimization and cautious interpretation of open-model capabilities, while proposing a practical sampling framework to efficiently stress-test MLLMs on IQA benchmarks. These findings inform future directions in using MLLMs for interpretable, text-driven IQA and model evaluation.

Abstract

While Multimodal Large Language Models (MLLMs) have experienced significant advancement in visual understanding and reasoning, their potential to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. We first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one closed-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, geometric transformations, and color differences) in both full-reference and no-reference scenarios. Experimental results show that only the closed-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

TL;DR

Abstract

Paper Structure (20 sections, 4 equations, 6 figures, 3 tables)

This paper contains 20 sections, 4 equations, 6 figures, 3 tables.

Introduction
Related Work
Expert Models for IQA
MLLMs for IQA
Prompting MLLMs for IQA
Prompting Strategies from Psychophysics
Prompting Strategies from NLP
Computational Procedure for Difficult Sample Selection
Experiments
Experimental Setups
Comparison of Nine Prompting Systems
Further Testing on Difficult Data
Discussion and Limitation
Conclusion
More Experimental Setups
...and 5 more sections

Figures (6)

Figure 1: Illustration of visual attributes of image quality in our experiments.
Figure 2: Three standardized psychophysical testing procedures for IQA. (a) Single-stimulus method. (b) Double-stimulus method. (c) Multiple-stimulus method.
Figure 3: Instantiations of systematic prompting strategies for GPT-4V in the NR scenario. (a) Standard prompting. (b) Chain-of-thought prompting. (c) In-context prompting. See complete FR and NR text prompts in the supplementary material.
Figure 4: Comparison between difficult sample selection with and without variance normalization under the same level of sample diversity.
Figure 5: Behaviors of different MLLMs in recognizing objects from multiple images.
...and 1 more figures

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

TL;DR

Abstract

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (6)