Table of Contents
Fetching ...

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, Shiqi Wang

TL;DR

The paper addresses the challenge of transferring relative quality judgments into continuous scores for no-reference image quality assessment (NR-IQA) by leveraging large multimodal models (LMMs). It introduces Compare2Score, which learns qualitative pairwise comparisons from within-dataset MOS, uses a multi-image visual quality comparator, and applies an adaptive soft comparison with anchor images to produce continuous quality scores via MAP under Thurstone's Case V. The key contributions are the comparative-instruction data generation within datasets, an anchor-based soft-inference framework, and extensive evaluation showing state-of-the-art NR-IQA performance and strong cross-dataset generalization, including zero-shot improvements for general LMMs. The approach offers scalable data integration across IQA datasets and practical impact for robust, real-world image quality evaluation across synthetic, realistic, and AI-generated distortions.

Abstract

While recent advancements in large multimodal models (LMMs) have significantly improved their abilities in image quality assessment (IQA) relying on absolute quality rating, how to transfer reliable relative quality comparison outputs to continuous perceptual quality scores remains largely unexplored. To address this gap, we introduce Compare2Score-an all-around LMM-based no-reference IQA (NR-IQA) model, which is capable of producing qualitatively comparative responses and effectively translating these discrete comparative levels into a continuous quality score. Specifically, during training, we present to generate scaled-up comparative instructions by comparing images from the same IQA dataset, allowing for more flexible integration of diverse IQA datasets. Utilizing the established large-scale training corpus, we develop a human-like visual quality comparator. During inference, moving beyond binary choices, we propose a soft comparison method that calculates the likelihood of the test image being preferred over multiple predefined anchor images. The quality score is further optimized by maximum a posteriori estimation with the resulting probability matrix. Extensive experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training with converted single image quality score for inference, surpassing state-of-the-art IQA models across diverse scenarios. Moreover, we verify that the probability-matrix-based inference conversion not only improves the rating accuracy of Compare2Score but also zero-shot general-purpose LMMs, suggesting its intrinsic effectiveness.

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

TL;DR

The paper addresses the challenge of transferring relative quality judgments into continuous scores for no-reference image quality assessment (NR-IQA) by leveraging large multimodal models (LMMs). It introduces Compare2Score, which learns qualitative pairwise comparisons from within-dataset MOS, uses a multi-image visual quality comparator, and applies an adaptive soft comparison with anchor images to produce continuous quality scores via MAP under Thurstone's Case V. The key contributions are the comparative-instruction data generation within datasets, an anchor-based soft-inference framework, and extensive evaluation showing state-of-the-art NR-IQA performance and strong cross-dataset generalization, including zero-shot improvements for general LMMs. The approach offers scalable data integration across IQA datasets and practical impact for robust, real-world image quality evaluation across synthetic, realistic, and AI-generated distortions.

Abstract

While recent advancements in large multimodal models (LMMs) have significantly improved their abilities in image quality assessment (IQA) relying on absolute quality rating, how to transfer reliable relative quality comparison outputs to continuous perceptual quality scores remains largely unexplored. To address this gap, we introduce Compare2Score-an all-around LMM-based no-reference IQA (NR-IQA) model, which is capable of producing qualitatively comparative responses and effectively translating these discrete comparative levels into a continuous quality score. Specifically, during training, we present to generate scaled-up comparative instructions by comparing images from the same IQA dataset, allowing for more flexible integration of diverse IQA datasets. Utilizing the established large-scale training corpus, we develop a human-like visual quality comparator. During inference, moving beyond binary choices, we propose a soft comparison method that calculates the likelihood of the test image being preferred over multiple predefined anchor images. The quality score is further optimized by maximum a posteriori estimation with the resulting probability matrix. Extensive experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training with converted single image quality score for inference, surpassing state-of-the-art IQA models across diverse scenarios. Moreover, we verify that the probability-matrix-based inference conversion not only improves the rating accuracy of Compare2Score but also zero-shot general-purpose LMMs, suggesting its intrinsic effectiveness.
Paper Structure (36 sections, 6 equations, 7 figures, 7 tables)

This paper contains 36 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustrations of the motivation of this work. (a) Images with identical rescaled MOS from various IQA datasets exhibit significant variations in perceptual quality. (b) Images that cluster at the same rating level from different IQA datasets display mismatches due to differing subjective testing methodologies. (c) By comparing MOSs within the same dataset, it facilitates the flexible combination of multiple IQA datasets.
  • Figure 2: Training and inference phash of Compare2Score. (a) The LMM is fine-tuned with instruction-response pairs generated by comparing the MOSs from the same IQA dataset, allowing for a more flexible combination of various IQA datasets. (b) The trained visual quality comparator (i.e., LMM) is utilized to compute the likelihood of a test image being preferred over the anchor images, and then the quality score is derived using MAP estimation.
  • Figure 3: Architecture of the proposed Compare2Score. Images are initially processed by an image encoder, followed by token reduction through an abstractor module. The aligned textual and visual embedding are interleaved and processed by the large language model (LLM) decoder to generate precise qualitative comparative levels for paired comparisons.
  • Figure 4: Comparisons of SRCC results and running time with different numbers of anchor image per quality interval ($\beta$).
  • Figure 5: Illustration of the five anchor images selected from KonIQ-10k hosu2020koniq. (a) MOS = $1.09$, $\sigma = 0.29$; (b) MOS = $2.02$, $\sigma = 0.39$; (c) MOS = $2.96$, $\sigma = 0.38$; (d) MOS = $3.21$, $\sigma = 0.41$; (e) MOS = $4.01$, $\sigma = 0.34$.
  • ...and 2 more figures