Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment
Fei Zhou, Tianhao Gu, Zhicong Huang, Guoping Qiu
TL;DR
This work tackles the challenge of blind image quality assessment (BIQA) by recognizing that semantic content, distortion characteristics, and appearance jointly shape perceived quality. It introduces SLIQUE, a two-branch framework that combines vision-language contrastive learning with visual self-supervised learning to learn high-level quality representations, and pairs it with a ridge-regression head for MOS prediction. The authors build the Text Annotated Distortion, Appearance and Content (TADAC) database, a 1.6M-image resource annotated for content, distortion, and appearance texts, enabling diverse, quality-relevant supervision. Empirical results across multiple synthetic, authentic, underwater, and enhanced IQA datasets show SLIQUE achieving state-of-the-art performance and strong cross-dataset generalization, demonstrating the effectiveness and practical impact of joint vision-language and self-supervised representations for IQA.
Abstract
The visual quality of an image is confounded by a number of intertwined factors including its semantic content, distortion characteristics and appearance properties such as brightness, contrast, sharpness, and colourfulness. Distilling high level knowledge about all these quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA).While existing solutions have modeled some of these aspects, a comprehensive solution that involves all these important quality related attributes has not yet been developed. In this paper, we present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) that features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA. For training SLIQUE, we have developed a systematic approach to constructing a first of its kind large image database annotated with all three categories of quality relevant texts. The Text Annotated Distortion, Appearance and Content (TADAC) database has over 1.6 million images annotated with textual descriptions of their semantic contents, distortion characteristics and appearance properties. The method for constructing TADAC and the database itself will be particularly useful for exploiting vision-language modeling for advanced IQA applications. Extensive experimental results show that SLIQUE has superior performances over state of the art, demonstrating the soundness of its design principle and the effectiveness of its implementation.
