Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Zhiyuan You; Zheyuan Li; Jinjin Gu; Zhenfei Yin; Tianfan Xue; Chao Dong

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, Chao Dong

TL;DR

DepictQA reframes image quality assessment as a language-based, multi-modal task rather than a single scalar score. By leveraging a frozen CLIP image encoder, a trainable image projector, and LoRA-tuned LLMs, it outputs descriptive, human-like evaluations and justifications across three tasks: distortion description, pairwise quality comparison, and reasoning to weigh factors. The authors introduce M-BAPPS, a large, richly described multi-modal IQA dataset derived from BAPPS, and demonstrate multi-source training with specialized image tags to enable robust, interpretable IQA that surpasses score-based methods on several benchmarks and general MLLMs after fine-tuning. They also explore non-reference extensions and provide extensive ablations, highlighting the method’s potential and current limitations in data scale, task coverage, and efficiency.

Abstract

We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods. DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models (MLLMs). Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans' reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset. To tackle the challenges of limited training data and multi-image processing, we propose to use multi-source training data and specialized image tags. These designs result in a better performance of DepictQA than score-based approaches on multiple benchmarks. Moreover, compared with general MLLMs, DepictQA can generate more accurate reasoning descriptive languages. We also demonstrate that our full-reference dataset can be extended to non-reference applications. These results showcase the research potential of multi-modal IQA methods. Codes and datasets are available in https://depictqa.github.io.

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

TL;DR

Abstract

Paper Structure (30 sections, 24 figures, 11 tables)

This paper contains 30 sections, 24 figures, 11 tables.

Introduction
Related Works
DepictQA Task and Dataset
Task Description
Dataset Construction
DepictQA Framework
Model Architecture
Training Scheme
Experiments
Metrics
Comparison with Score-based IQA Methods
Comparison with General Multi-modal LLMs
Ablation Studies
Extensions
Conclusions and Limitations
...and 15 more sections

Figures (24)

Figure 1: Comparison between our DepictQA and score-based IQA methods, including PSNR, SSIM ssim, LPIPS bapps, and PieAPP pieapp. Score-based IQA methods only provide numerical scores devoid of reasoning and justification. Thus they disagree with human judgments in complex scenarios when (a) images are misaligned and (b) both images suffer from severe distortions. In contrast, DepictQA first identifies the distortions of images, then weighs the influences of different distortions to the texture damages, and finally obtains the comparison results that are better aligned with human judgments.
Figure 2: Collection of the responses in our M-BAPPS dataset. We first carefully design a questionnaire to collect quality-related information. We then employ the GPT-4 gpt4 to convert our annotated questionnaire results into natural language. Finally, the outputs of GPT-4 are modified and improved by the annotators to correct errors, eliminate ambiguities, and supplement important information.
Figure 3: Framework of DepictQA. A frozen pre-trained image encoder is employed to encode images to visual tokens, followed by a trainable image projector to project visual tokens to textual space. The question texts are tokenized by a text tokenizer. Visual tokens and textual tokens are then fused and jointly processed by an LLM, fine-tuned through the LoRA technique lora. Our model is capable of producing comprehensive and informative explanations for image quality comparisons.
Figure 4: Unique tag alleviates the confusion problem using clearer instructions. The confusion rate drops dramatically.
Figure 5: The comparison performance gradually increases with the size of training data increasing.
...and 19 more figures

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

TL;DR

Abstract

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (24)