Table of Contents
Fetching ...

Quality-Aware Image-Text Alignment for Opinion-Unaware Image Quality Assessment

Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini

TL;DR

QualiCLIP introduces a self-supervised, opinion-unaware NR-IQA method that fine-tunes CLIP to produce quality-aware image representations by ranking progressively degraded image crops against antonym prompts. It relies on a quality-aware image-text alignment with three margin losses and synthetic degradations, achieving state-of-the-art performance among opinion-unaware methods and strong cross-dataset generalization without MOS. The approach demonstrates robustness across authentic, restoration, and AIGC datasets and offers competitive results versus supervised methods, while maintaining efficient inference. Overall, it provides a strong, scalable baseline for NR-IQA that leverages vision-language pretraining to emphasize low-level image quality cues.

Abstract

No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. Most state-of-the-art NR-IQA approaches are opinion-aware, i.e. they require human annotations for training. This dependency limits their scalability and broad applicability. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware approach that does not require human opinions. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate quality-aware image representations. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts. At the same time, we force CLIP to generate consistent representations for images with similar content and the same level of degradation. Our experiments show that the proposed method improves over existing opinion-unaware approaches across multiple datasets with diverse distortion types. Moreover, despite not requiring human annotations, QualiCLIP achieves excellent performance against supervised opinion-aware methods in cross-dataset experiments, thus demonstrating remarkable generalization capabilities. The code and the model are publicly available at https://github.com/miccunifi/QualiCLIP.

Quality-Aware Image-Text Alignment for Opinion-Unaware Image Quality Assessment

TL;DR

QualiCLIP introduces a self-supervised, opinion-unaware NR-IQA method that fine-tunes CLIP to produce quality-aware image representations by ranking progressively degraded image crops against antonym prompts. It relies on a quality-aware image-text alignment with three margin losses and synthetic degradations, achieving state-of-the-art performance among opinion-unaware methods and strong cross-dataset generalization without MOS. The approach demonstrates robustness across authentic, restoration, and AIGC datasets and offers competitive results versus supervised methods, while maintaining efficient inference. Overall, it provides a strong, scalable baseline for NR-IQA that leverages vision-language pretraining to emphasize low-level image quality cues.

Abstract

No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. Most state-of-the-art NR-IQA approaches are opinion-aware, i.e. they require human annotations for training. This dependency limits their scalability and broad applicability. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware approach that does not require human opinions. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate quality-aware image representations. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts. At the same time, we force CLIP to generate consistent representations for images with similar content and the same level of degradation. Our experiments show that the proposed method improves over existing opinion-unaware approaches across multiple datasets with diverse distortion types. Moreover, despite not requiring human annotations, QualiCLIP achieves excellent performance against supervised opinion-aware methods in cross-dataset experiments, thus demonstrating remarkable generalization capabilities. The code and the model are publicly available at https://github.com/miccunifi/QualiCLIP.
Paper Structure (21 sections, 5 equations, 14 figures, 11 tables)

This paper contains 21 sections, 5 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Comparison between the image quality scores predicted by CLIP-IQA wang2023exploring and the proposed QualiCLIP for increasing distortion intensities of different types of synthetic degradation. We average the results of 1000 randomly sampled images from the KonIQ-10k hosu2020koniq10k dataset. Our method corresponds to a stronger inverse correlation between the predicted quality scores and the severity of the degradation. The distortion intensities are scaled between 0 and 1 for clearer visualization.
  • Figure 2: Examples of synthetic degradations for five increasing levels of intensity.
  • Figure 3: Overview of the proposed quality-aware image-text alignment strategy. Starting from a pair of two random overlapping crops from a pristine image, we synthetically degrade them with $L$ increasing levels of intensity, resulting in $L$ pairs. Then, given two quality-related antonym prompts $T_p$ and $T_n$, we fine-tune CLIP's image encoder with three margin ranking losses ($\mathcal{L}_{cons}\xspace$, $\mathcal{L}_{pos}\xspace$, $\mathcal{L}_{neg}\xspace$) by considering the similarity between the prompts and the degraded crops. Specifically, we use $\mathcal{L}_{cons}\xspace$ to force CLIP to generate consistent representations for the crops belonging to each pair, since they exhibit similar content and the same degree of distortion. At the same time, we make the similarity between the prompt $T_p$ (or $T_n$) and the increasingly degraded versions of the crops correlate inversely (or directly) with the intensity of the distortion through $\mathcal{L}_{pos}\xspace$ (or $\mathcal{L}_{neg}\xspace$).
  • Figure S1: gMAD competition results between QualiCLIP and GRepQ srinath2024learning. (a): Fixed QualiCLIP at a low- (top) and high-quality (bottom) level, respectively. (b): Fixed GRepQ at a low- (top) and high-quality (bottom) level, respectively.
  • Figure S2: gMAD ma2016group competition results between QualiCLIP and CLIP-IQA wang2023exploring. (a): Fixed QualiCLIP at a low- (top) and high-quality (bottom) level, respectively. (b): Fixed CLIP-IQA at a low- (top) and high-quality (bottom) level, respectively.
  • ...and 9 more figures