Table of Contents
Fetching ...

SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment

Mahdi Naseri, Zhou Wang

Abstract

No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.

SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment

Abstract

No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.
Paper Structure (74 sections, 28 equations, 11 figures, 9 tables)

This paper contains 74 sections, 28 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Overview of the proposed SHAMISA framework.(1) Self-supervised pre-training: pristine images are transformed by a compositional distortion engine to form a mini-batch $\mathbf{X}$ and distortion metadata; the encoder $f_{\theta}$ and projector $g_\psi$ produce representations $\mathbf{H}$ and embeddings $\mathbf{Z}$ used to construct dual-source relation graphs (metadata-driven and structurally intrinsic), which are aggregated by a multi-source aggregator $\Phi$ into $\mathbf{G}$ and optimize a graph-weighted SSL objective. (2) Downstream NR-IQA: we freeze $f_{\theta}$ and train a lightweight regressor on top of $\mathbf{H}$ to predict quality scores.
  • Figure 2: gMAD on Waterloo Exploration (distorted pool): panels (a,b) use SHAMISA as the defender and panels (c,d) use ARNIQA as the defender. In each panel, the top and bottom images are the attacker-selected pair from the indicated defender-quality bin, with low-bin cases shown in (a,c) and high-bin cases shown in (b,d). Supplementary Table VII reports the corresponding top-1 attacker gaps.
  • Figure 3: t-SNE visualization of image-level encoder representations on KADID-10K. We extract encoder representations $\mathbf{H}$ (pre-projector) and average across crops to obtain one representation per image, yielding 10,125 image-level points. Points are colored by the coarse distortion family using a shared fixed palette. Compared to ARNIQA, SHAMISA exhibits clearer coarse-family structure for several dominant degradation families, while both models show overlap among visually related families such as brightness and color manipulations. This visualization is a qualitative diagnostic and does not affect the quantitative evaluation protocol.
  • Figure 4: Manifold visualization with UMAP mcinnes2018umap using encoder representations $\mathbf{H}$ extracted from 1,000 pristine KADIS images degraded with Gaussian blur and white noise. For each pristine image we generate 5 blur-only samples, 5 noise-only samples, and all $5 \times 5$ blur$\rightarrow$noise compositions (35 variants per image, 35k total points). Point color is the weighted average of blur (red) and noise (yellow) according to their relative intensities; marker opacity increases with total severity.
  • Figure 5: SRCC vs. training steps during SSL pre-training (mean with $\pm$ std error bars over 10 splits) across representative NR-IQA benchmarks. Curves typically rise early and then saturate, while the best checkpoint can vary by dataset since the SSL objective is not identical to downstream SRCC.
  • ...and 6 more figures