Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset
Zhiyuan You, Jinjin Gu, Xin Cai, Zheyuan Li, Kaiwen Zhu, Chao Dong, Tianfan Xue
TL;DR
The paper addresses the fragmentation and real-world applicability gaps in IQA by introducing DepictQA-Wild, a large-scale, multi-modal, descriptive IQA framework trained on the 495K-scale DQ-495K dataset. It defines a unified task paradigm for single-image assessment and paired-image comparison, encompassing full-reference and non-reference settings, with brief and detailed responses. Key contributions include a 12+35 distortion library, GT-informed data generation, resolution-preserving training, and a confidence-aware answering mechanism, all validated by extensive benchmarks, ablations, and real-world applications. The results show superior performance over score-based and prior VLM-based IQA methods, highlighting practical potential for real-world image quality analysis and retrieval tasks.
Abstract
With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce the enhanced Depicted image Quality Assessment model (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Codes, datasets, and model weights have been released in https://depictqa.github.io/.
