Table of Contents
Fetching ...

Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment

Zhicheng Liao, Dongxu Wu, Zhenshan Shi, Sijie Mai, Hanwei Zhu, Lingyu Zhu, Yuncheng Jiang, Baoliang Chen

TL;DR

No-Reference Image Quality Assessment using CLIP often relies on semantic similarity $Q_{sim}$, which ignores embedding magnitude and can miss degradations. The authors introduce MA-CLIP, a magnitude-aware framework that computes $Q_{mag}$ from absolute CLIP features via Box-Cox normalization (parameter $\lambda$) and fuses it with $Q_{sim}$ through a confidence-guided mechanism using $\Delta=Q_{sim}-Q_{mag}$ and logits $\gamma_{sim}=1.0+\alpha\Delta$, $\gamma_{mag}=0.6-\alpha\Delta$ to produce final $Q$. The approach is training-free and achieves state-of-the-art zero-shot IQA performance across diverse datasets (synthetic, authentic, IR, and AIGC content), with backbone-agnostic gains. This work demonstrates that internal magnitude cues in pretrained vision-language models can be leveraged, with simple normalization and fusion, to deliver robust and generalizable perceptual quality assessment without task-specific supervision.

Abstract

Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as "a good photo" or "a bad photo." However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment

TL;DR

No-Reference Image Quality Assessment using CLIP often relies on semantic similarity , which ignores embedding magnitude and can miss degradations. The authors introduce MA-CLIP, a magnitude-aware framework that computes from absolute CLIP features via Box-Cox normalization (parameter ) and fuses it with through a confidence-guided mechanism using and logits , to produce final . The approach is training-free and achieves state-of-the-art zero-shot IQA performance across diverse datasets (synthetic, authentic, IR, and AIGC content), with backbone-agnostic gains. This work demonstrates that internal magnitude cues in pretrained vision-language models can be leveraged, with simple normalization and fusion, to deliver robust and generalizable perceptual quality assessment without task-specific supervision.

Abstract

Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as "a good photo" or "a bad photo." However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

Paper Structure

This paper contains 25 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: (a) Limitations of prompt-based CLIP-IQA: although the images exhibit a wide range of perceptual quality (reflected in their MOS), the cosine similarity between the image embedding and textual prompts remains nearly constant. In contrast, the feature magnitude shows a strong correlation with MOS. (b) Complementary behaviors of the two cues across quality levels: As the scatter plots of SPAQ dataset shows that cosine similarity is more reliable in the high-quality region, where semantic features align well with CLIP's pretrained distribution; feature magnitude is more discriminative under low-quality distortions, where semantic alignment breaks down. These observations motivate our dual-cue fusion framework that adaptively integrates both signals for robust quality prediction.
  • Figure 2: Overview of the Proposed Magnitude-Aware CLIP IQA Framework. Given an input image, we extract its CLIP image embedding and compute two quality signals: (1) $Q_{\text{sim}}$, the image semantic similarity with text prompts, and (2) $Q_{\text{mag}}$, a magnitude-based score obtained via Box-Cox transformation for statistical normalization. To adaptively balance these complementary cues, we compute a confidence discrepancy and generate softmax-based fusion weights, producing the final quality prediction $Q$.
  • Figure 3: Semantic Bias exists in CLIP feature. This feature magnitudes for visually similar-quality images differ substantially across semantic categories. Statistical normalization is vital to make magnitude cues reliable. WD represents the Wasserstein Distance between two Feature distribution.
  • Figure 4: MOS alignment visualization. Representative examples from multiple datasets illustrating the ranking alignment between MOSs and our MA-CLIP predictions. Each triplet shows three images with their MOSs and predicted quality scores of CLIP-IQA and our MA-CLIP.
  • Figure 5: Scatter plot comparison of CLIP-IQA and MA-CLIP. The x-axis represents the MOS, while the y-axis shows the prediction. As the scatter gets closer to the ideal line, it indicates that the model predicts better.
  • ...and 1 more figures