Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

Fei Zhou; Tianhao Gu; Zhicong Huang; Guoping Qiu

Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

Fei Zhou, Tianhao Gu, Zhicong Huang, Guoping Qiu

TL;DR

This work tackles the challenge of blind image quality assessment (BIQA) by recognizing that semantic content, distortion characteristics, and appearance jointly shape perceived quality. It introduces SLIQUE, a two-branch framework that combines vision-language contrastive learning with visual self-supervised learning to learn high-level quality representations, and pairs it with a ridge-regression head for MOS prediction. The authors build the Text Annotated Distortion, Appearance and Content (TADAC) database, a 1.6M-image resource annotated for content, distortion, and appearance texts, enabling diverse, quality-relevant supervision. Empirical results across multiple synthetic, authentic, underwater, and enhanced IQA datasets show SLIQUE achieving state-of-the-art performance and strong cross-dataset generalization, demonstrating the effectiveness and practical impact of joint vision-language and self-supervised representations for IQA.

Abstract

The visual quality of an image is confounded by a number of intertwined factors including its semantic content, distortion characteristics and appearance properties such as brightness, contrast, sharpness, and colourfulness. Distilling high level knowledge about all these quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA).While existing solutions have modeled some of these aspects, a comprehensive solution that involves all these important quality related attributes has not yet been developed. In this paper, we present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) that features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA. For training SLIQUE, we have developed a systematic approach to constructing a first of its kind large image database annotated with all three categories of quality relevant texts. The Text Annotated Distortion, Appearance and Content (TADAC) database has over 1.6 million images annotated with textual descriptions of their semantic contents, distortion characteristics and appearance properties. The method for constructing TADAC and the database itself will be particularly useful for exploiting vision-language modeling for advanced IQA applications. Extensive experimental results show that SLIQUE has superior performances over state of the art, demonstrating the soundness of its design principle and the effectiveness of its implementation.

Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

TL;DR

Abstract

Paper Structure (18 sections, 6 equations, 6 figures, 14 tables)

This paper contains 18 sections, 6 equations, 6 figures, 14 tables.

Introduction
Related Work
Blind Image Quality Assessment
Vision-Language Contrastive Learning
Visual Self-supervised Learning
SLIQUE
Joint Contrastive Learning
Regression to Quality Score
Construction of database
Training Sample Pairing Strategy
Comparisons with Previous BIQA Works
Experiments
Implementation Details
Results and Comparisons
T-SNE Visualization of Quality Relevant Features
...and 3 more sections

Figures (6)

Figure 1: Example images containing synthetic (a) (b) and real (c) (d) distortions. Optimal when zoomed in.
Figure 2: Training the image encoder of SLIQUE. The Image-Language Branch performs vision-language constrastive learning for aligning text labels and image contents and the Image-Image Branch carries out self-supervised visual learning. The objective is to train the image encoder for extracting discriminative image features that capture all categories of quality relevant image attributes. Note all three image encoders in the diagram are identical triplets. The purpose of the saliency-based cropping module is to use visual saliency for cropping a high resolution sub-image containing the maximum amount of visual information during training (rather than using the whole image which can be too large). After the system is trained, only the image econder is used for extracting image features for predicting the image quality, see Section \ref{['regression']}.
Figure 3: The distribution of four quality relevant aspects of authetic image in TADAC (a)Brightness, (b)Contrast, (c)Sharpness and (d)Colorfulness
Figure 4: 2D t-SNE visualizations of learned representations. For both CONTRIQUE and SLIQUE, we conducted 2D t-SNE visualization experiments using 7 different types of distorted images from 3 image databases and 3 different types of images from KADIS database.
Figure 5: 2D t-SNE visualizations of learned representations for 1,000 images in the Wild from KonIQ.
...and 1 more figures

Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

TL;DR

Abstract

Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (6)