Table of Contents
Fetching ...

Scale Contrastive Learning with Selective Attentions for Blind Image Quality Assessment

Runze Hu, Zihao Huang, Xudong Li, Bohan Fu, Yan Zhang, Sicheng Zhao

TL;DR

CSFIQA tackles the long-standing challenge of scale-dependent quality perception in blind image quality assessment by introducing a triad of mechanisms: Scale Contrastive Learning (SCL) to differentiate inter- and intra-scale quality, Noise Sample Matching (NSM) to emphasize content regions with maximal cross-scale discrepancies, and Selective Focus Attention (SFA) to filter redundant cross-scale information and amplify quality-relevant features. The framework processes image patches at multiple scales through a CrossViT encoder, uses MOS-guided sample selection for contrastive learning, and incorporates a frozen LLM (Llama-7B) as an Information Concentrator to boost salient features. Empirical results across eight IQA datasets show CSFIQA achieving state-of-the-art performance on six of seven datasets, with notable improvements on challenging LIVEFB and LIVEC datasets, and strong cross-dataset generalization. The work also provides extensive ablations and cost analyses, demonstrating that selective attention and scale-aware contrastive signals yield better quality estimation with a controlled increase in parameters and computational overhead.

Abstract

Human visual perception naturally evaluates image quality across multiple scales, a hierarchical process that existing blind image quality assessment (BIQA) algorithms struggle to replicate effectively. This limitation stems from a fundamental misunderstanding: current multi-scale approaches fail to recognize that quality perception varies dramatically between scales -- what appears degraded when viewed closely may look acceptable from a distance. This inconsistency not only creates misleading ``visual illusions'' during feature fusion but also introduces substantial redundant information that dilutes quality-critical features and leads to imprecise assessments. Our CSFIQA framework advances multi-scale BIQA via two key innovations: (1) a selective focus attention mechanism that mimics human visual attention by filtering out redundant cross-scale information that would otherwise mask subtle quality indicators, and (2) a scale contrastive learning strategy that explicitly learns to distinguish quality variations both across and within scales. By incorporating an adaptive noise sample matching mechanism, CSFIQA effectively identifies perceptual quality discrepancies in the same content viewed at different scales. Experiments demonstrate substantial improvements over state-of-the-art methods across seven datasets, achieving up to 8.8% SRCC improvement on challenging real-world distortions, confirming CSFIQA's superior alignment with human quality perception.

Scale Contrastive Learning with Selective Attentions for Blind Image Quality Assessment

TL;DR

CSFIQA tackles the long-standing challenge of scale-dependent quality perception in blind image quality assessment by introducing a triad of mechanisms: Scale Contrastive Learning (SCL) to differentiate inter- and intra-scale quality, Noise Sample Matching (NSM) to emphasize content regions with maximal cross-scale discrepancies, and Selective Focus Attention (SFA) to filter redundant cross-scale information and amplify quality-relevant features. The framework processes image patches at multiple scales through a CrossViT encoder, uses MOS-guided sample selection for contrastive learning, and incorporates a frozen LLM (Llama-7B) as an Information Concentrator to boost salient features. Empirical results across eight IQA datasets show CSFIQA achieving state-of-the-art performance on six of seven datasets, with notable improvements on challenging LIVEFB and LIVEC datasets, and strong cross-dataset generalization. The work also provides extensive ablations and cost analyses, demonstrating that selective attention and scale-aware contrastive signals yield better quality estimation with a controlled increase in parameters and computational overhead.

Abstract

Human visual perception naturally evaluates image quality across multiple scales, a hierarchical process that existing blind image quality assessment (BIQA) algorithms struggle to replicate effectively. This limitation stems from a fundamental misunderstanding: current multi-scale approaches fail to recognize that quality perception varies dramatically between scales -- what appears degraded when viewed closely may look acceptable from a distance. This inconsistency not only creates misleading ``visual illusions'' during feature fusion but also introduces substantial redundant information that dilutes quality-critical features and leads to imprecise assessments. Our CSFIQA framework advances multi-scale BIQA via two key innovations: (1) a selective focus attention mechanism that mimics human visual attention by filtering out redundant cross-scale information that would otherwise mask subtle quality indicators, and (2) a scale contrastive learning strategy that explicitly learns to distinguish quality variations both across and within scales. By incorporating an adaptive noise sample matching mechanism, CSFIQA effectively identifies perceptual quality discrepancies in the same content viewed at different scales. Experiments demonstrate substantial improvements over state-of-the-art methods across seven datasets, achieving up to 8.8% SRCC improvement on challenging real-world distortions, confirming CSFIQA's superior alignment with human quality perception.

Paper Structure

This paper contains 18 sections, 8 equations, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) The same image under different patch perspectives can lead to varying quality judgments, and simply combining information from different viewpoints is prone to causing visual hallucinations. (b-d) Comparison of mainstream multi-scale paradigms with our approach, which uses scale contrastive learning to distinguish quality differences in (a). The designed selective focus attention module can remove redundant semantic information and enhance attention related to perceptual quality.
  • Figure 2: The MOS distribution of LiveFB in large scale (40% of the original image size) and small scale (20% of the original image size).
  • Figure 3: The overall of our CSFIQA. For a given image, the $l$-th layer Transformer Encoder extracts image features $F^l_a$ ($a \in \{small, large\}$) at different scales. These features are then input into the SCL module (Sec. \ref{['SCL']}), which uses the relative distance of quality labels in the mini-batch to identify inter and intra-scale positive and negative samples for contrastive learning, enhancing the ability to discriminate quality differences between images. The NSM (Sec. \ref{['SCL']}) further increases the distance between quality representations that exhibit significant differences across multiple scales within the same image, thereby distinguishing subtle quality variations between regions to address the "visual illusions" effect caused by varying quality information (see Fig. \ref{['fig1']}(a)). Features from the final encoder layer $F^L_a$ enter the SFA module (Sec. \ref{['SFA']}), where the Adaptive Filtering Selector (AFS) retains the top-k self-attention similarity scores, and the Information Concentrator Module amplifies important details to obtain quality-aware features. These features are then decoded to predict the quality $\hat{Y}$ (see Algorithm \ref{['code:ref']} for details).
  • Figure 4: Grad-CAM Activation Maps of DEIQT and CSFIQA on LIVEC dataset. Scores below the first row indicate ground-truth MOS. Our model focuses more on distorted regions, leading to predictions closer to true values. Rows 1--3: input image, baseline CAM, CSFIQA CAM. Rows 4--5: large and small scale feature visualizations of CSFIQA.
  • Figure 5: We conducted a visualization ablation experiment on the SFA module, where we obtained the attention map of each layer before entering the module using single Grad-CAM, in order to analyze the role played by the selection and focusing modules.