2AFC Prompting of Large Multimodal Models for Image Quality Assessment

Hanwei Zhu; Xiangjie Sui; Baoliang Chen; Xuelin Liu; Peilin Chen; Yuming Fang; Shiqi Wang

2AFC Prompting of Large Multimodal Models for Image Quality Assessment

Hanwei Zhu, Xiangjie Sui, Baoliang Chen, Xuelin Liu, Peilin Chen, Yuming Fang, Shiqi Wang

TL;DR

The paper tackles image quality assessment with large multimodal models by framing IQA as a 2AFC prompting task and using MAP estimation to convert pairwise preferences into a global ranking. It introduces coarse-to-fine pairing rules and three evaluation metrics—consistency, accuracy, and correlation—to systematically quantify IQA ability across diverse datasets. Experiments across eight IQA datasets show that GPT-4V most closely matches human judgments at a coarse level, while open LMMs exhibit biases and struggle with fine-grained discrimination, indicating substantial room for improvement. The work provides a practical benchmark and methodology to guide future development of LMM-based IQA systems and highlights the value of realistic distortions in training data.

Abstract

While abundant research has been conducted on improving high-level visual understanding and reasoning capabilities of large multimodal models~(LMMs), their visual quality assessment~(IQA) ability has been relatively under-explored. Here we take initial steps towards this goal by employing the two-alternative forced choice~(2AFC) prompting, as 2AFC is widely regarded as the most reliable way of collecting human opinions of visual quality. Subsequently, the global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posterior estimation. Meanwhile, we introduce three evaluation criteria: consistency, accuracy, and correlation, to provide comprehensive quantifications and deeper insights into the IQA capability of five LMMs. Extensive experiments show that existing LMMs exhibit remarkable IQA ability on coarse-grained quality comparison, but there is room for improvement on fine-grained quality discrimination. The proposed dataset sheds light on the future development of IQA models based on LMMs. The codes will be made publicly available at https://github.com/h4nwei/2AFC-LMMs.

2AFC Prompting of Large Multimodal Models for Image Quality Assessment

TL;DR

Abstract

Paper Structure (11 sections, 5 equations, 3 figures, 5 tables)

This paper contains 11 sections, 5 equations, 3 figures, 5 tables.

Introduction
Ingredients of Probingg Pipeline
Coarse-to-fine Pairing
Maximum a Posterior Estimation
Evaluation Criteria
Experiments
Experimental Setups
Coarse-grained IQA Performance
Fine-grained IQA Performance
Ablation Experiments
Conclusions

Figures (3)

Figure 1: Probing the IQA capability of LMMs via two-alternative forced choice. (a) A pair of images with the corresponding normalized mean opinion scores (MOSs), which is in the range of $[0,100]$. A larger value indicates better visual quality. (b) An order reversed version of (a). Humans can effortlessly select the "Train" image with better visual quality regardless of presentation order, but it is unclear whether the LMMs can make the same right choice. In this example, IDEFICS-Instruct IDEFICS gives the incorrect prediction. mPULG-Owl ye2023mplug XComposer-VL zhang2023internlm, and Q-Instruct wu2023q are indifferent to presentation order, and biased towards selecting the second and the first image, respectively. The proprietary GPT-4V gpt4v is well aligned with human perception of visual quality.
Figure 2: Illustration of three pairing rules for fine-grained quality comparison. (a)&(b) Two synthetically distorted images with identical visual content and distortion type but different distortion levels. (c)&(d) Two synthetically distorted images with identical visual content and distortion level but different distortion types. (e)&(f) Two realistically distorted images in the MOS interval of $[0, 25)$.
Figure 3: Validation of MAP estimation in aggregating pairwise rankings from the human observer, NIQE, and DBCNN, respectively. MAP estimation quickly converges as the number of pairing rounds increases, where each round consists of $N$ paired comparisons for $N$ test images. Performance on sampled images from SPAQ ($N = 160$).

2AFC Prompting of Large Multimodal Models for Image Quality Assessment

TL;DR

Abstract

2AFC Prompting of Large Multimodal Models for Image Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (3)