Table of Contents
Fetching ...

Revisiting Vision Language Foundations for No-Reference Image Quality Assessment

Ankit Yadav, Ta Duc Huy, Lingqiao Liu

Abstract

Large-scale vision language pre-training has recently shown promise for no-reference image-quality assessment (NR-IQA), yet the relative merits of modern Vision Transformer foundations remain poorly understood. In this work, we present the first systematic evaluation of six prominent pretrained backbones, CLIP, SigLIP2, DINOv2, DINOv3, Perception, and ResNet, for the task of No-Reference Image Quality Assessment (NR-IQA), each finetuned using an identical lightweight MLP head. Our study uncovers two previously overlooked factors: (1) SigLIP2 consistently achieves strong performance; and (2) the choice of activation function plays a surprisingly crucial role, particularly for enhancing the generalization ability of image quality assessment models. Notably, we find that simple sigmoid activations outperform commonly used ReLU and GELU on several benchmarks. Motivated by this finding, we introduce a learnable activation selection mechanism that adaptively determines the nonlinearity for each channel, eliminating the need for manual activation design, and achieving new state-of-the-art SRCC on CLIVE, KADID10K, and AGIQA3K. Extensive ablations confirm the benefits across architectures and regimes, establishing strong, resource-efficient NR-IQA baselines.

Revisiting Vision Language Foundations for No-Reference Image Quality Assessment

Abstract

Large-scale vision language pre-training has recently shown promise for no-reference image-quality assessment (NR-IQA), yet the relative merits of modern Vision Transformer foundations remain poorly understood. In this work, we present the first systematic evaluation of six prominent pretrained backbones, CLIP, SigLIP2, DINOv2, DINOv3, Perception, and ResNet, for the task of No-Reference Image Quality Assessment (NR-IQA), each finetuned using an identical lightweight MLP head. Our study uncovers two previously overlooked factors: (1) SigLIP2 consistently achieves strong performance; and (2) the choice of activation function plays a surprisingly crucial role, particularly for enhancing the generalization ability of image quality assessment models. Notably, we find that simple sigmoid activations outperform commonly used ReLU and GELU on several benchmarks. Motivated by this finding, we introduce a learnable activation selection mechanism that adaptively determines the nonlinearity for each channel, eliminating the need for manual activation design, and achieving new state-of-the-art SRCC on CLIVE, KADID10K, and AGIQA3K. Extensive ablations confirm the benefits across architectures and regimes, establishing strong, resource-efficient NR-IQA baselines.

Paper Structure

This paper contains 27 sections, 3 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: In this comparison, the sigmoid-activated MLP head provides more precise Grad-CAM localization of natural blur on and around faces, than the alternative variant, a cue closely tied to perceived image quality. Example from CLIVEghadiyaram2015massive.
  • Figure 2: Channel-wise distributions of the gate weight $w = \sigma(g)$ learned by the gated activation head for the final epoch, comparing CLIVE (low-data regime) and KonIQ10K (large-data regime). Larger $w$ indicates greater reliance on the sigmoid branch, while smaller $w$ favors the LeakyReLU branch. See supplementary Figure S7 For epoch-wise distribution.
  • Figure 3: This figure depicts our Adaptive Gated MLP where both the activation layers of the network consist of parameterized Leaky-ReLU and Sigmoid whose outputs are mixed per channel through a learnable gate $(w_{c} = \sigma(g_{c}))$. All activation parameters are learned jointly with the linear weights.
  • Figure 4: t-SNE comparison of held-out test data (left) versus training data (right) on CLIVE Dataset. We analyze feature representation of different configurations: Encoder Only$\rightarrow$Baseline MLP$\rightarrow$Sigmoid MLP$\rightarrow$Param-Gated MLP . Progressively tighter clusters and sharper bucket boundaries indicate the contribution of each module.
  • Figure 5: Grad-CAM of SIGLIP2 encoder features, grouped by response magnitude. High-response features align with semantic content, while mid-response features capture subtle artifacts.
  • ...and 11 more figures