Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset

Zhiyuan You; Jinjin Gu; Xin Cai; Zheyuan Li; Kaiwen Zhu; Chao Dong; Tianfan Xue

Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset

Zhiyuan You, Jinjin Gu, Xin Cai, Zheyuan Li, Kaiwen Zhu, Chao Dong, Tianfan Xue

TL;DR

The paper addresses the fragmentation and real-world applicability gaps in IQA by introducing DepictQA-Wild, a large-scale, multi-modal, descriptive IQA framework trained on the 495K-scale DQ-495K dataset. It defines a unified task paradigm for single-image assessment and paired-image comparison, encompassing full-reference and non-reference settings, with brief and detailed responses. Key contributions include a 12+35 distortion library, GT-informed data generation, resolution-preserving training, and a confidence-aware answering mechanism, all validated by extensive benchmarks, ablations, and real-world applications. The results show superior performance over score-based and prior VLM-based IQA methods, highlighting practical potential for real-world image quality analysis and retrieval tasks.

Abstract

With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce the enhanced Depicted image Quality Assessment model (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Codes, datasets, and model weights have been released in https://depictqa.github.io/.

Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset

TL;DR

Abstract

Paper Structure (21 sections, 20 figures, 23 tables)

This paper contains 21 sections, 20 figures, 23 tables.

Introduction
Related Works
Task Paradigm and Dataset Construction
Task Paradigm
Distortion Library
Dataset Construction
Dataset Analysis
Model Design
Experiments
Metrics and Baselines
Results on Benchmarks
Quality score regression
Ablation Studies
Real-world Applications
Complexity and Efficiency
...and 6 more sections

Figures (20)

Figure 1: Performance comparison. Our model surpasses previous works including Q-Instruct qinstruct, Co-Instruct coinstruct, and the proprietary GPT-4V gpt4v across a broad range of tasks in both full-reference and non-reference settings. Traditional score-based IQA methods like LPIPS lpips and MUSIQ musiq have no language abilities, and thus can only be used in instant rating task. Q-Instruct is only tested on single-image input tasks.
Figure 2: Illustration of our task paradigm and qualitative results. Our DepictQA-Wild focuses on two main tasks including single-image assessment and paired-image comparison in both full-reference and non-reference settings. Each task contains a brief sub-task focusing on the fundamental IQA ability, and a detailed sub-task fostering the reasoning capacities. More qualitative results are provided in \ref{['supp:fig:A']} and \ref{['supp:fig:AB']} of Supp. Mat.
Figure 3: Construction of DQ-495K dataset. For distortion identification, templated responses are generated using distortion information. In instant rating, we sample images from existing datasets and compare the Mean Opinion Score (MOS) to determine the better image for templated response creation. For assessment reasoning and comparison reasoning tasks, we provide GPT-4V with evaluated images and Ground Truth (GT) details (i.e., distortion information, comparison results from an assistant model) to facilitate detailed and accurate response generation, called GT-informed generation. This additional information is critical as GPT-4V cannot produce it accurately.
Figure 4: Statistics of all images in our dataset about (a) semantic features, (b) brightness, (c) contrast, (d) colorfulness, (e) edge density, and (f) texture variance.
Figure 5: Word length distribution of detailed responses.
...and 15 more figures

Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset

TL;DR

Abstract

Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (20)