No-Reference Image Quality Assessment with Global-Local Progressive Integration and Semantic-Aligned Quality Transfer

Xiaoqi Wang; Yun Zhang

No-Reference Image Quality Assessment with Global-Local Progressive Integration and Semantic-Aligned Quality Transfer

Xiaoqi Wang, Yun Zhang

TL;DR

This work tackles no-reference image quality assessment by combining a Vision Transformer-based global feature extractor with a CNN-based local feature extractor in a global-local progressive integration framework (GlintIQA). To address data scarcity and content diversity, it introduces the semantic-aligned quality transfer (SAQT) method and the SAQT-IQA dataset, enabling semantically aware label transfer for degraded images. Empirical results show state-of-the-art or competitive performance on authentic and synthetic distortion benchmarks, with significant cross-dataset gains, especially when SAQT pretraining is used. Overall, the paper demonstrates that dual-stream feature fusion plus content-aware data augmentation yields robust NR-IQA with strong generalization across distortion types and content, advancing practical image quality assessment in real-world deployments.

Abstract

Accurate measurement of image quality without reference signals remains a fundamental challenge in low-level visual perception applications. In this paper, we propose a global-local progressive integration model that addresses this challenge through three key contributions: 1) We develop a dual-measurement framework that combines vision Transformer (ViT)-based global feature extractor and convolutional neural networks (CNNs)-based local feature extractor to comprehensively capture and quantify image distortion characteristics at different granularities. 2) We propose a progressive feature integration scheme that utilizes multi-scale kernel configurations to align global and local features, and progressively aggregates them via an interactive stack of channel-wise self-attention and spatial interaction modules for multi-grained quality-aware representations. 3) We introduce a semantic-aligned quality transfer method that extends the training data by automatically labeling the quality scores of diverse image content with subjective opinion scores. Experimental results demonstrate that our model yields 5.04% and 5.40% improvements in Spearman's rank-order correlation coefficient (SROCC) for cross-authentic and cross-synthetic dataset generalization tests, respectively. Furthermore, the proposed semantic-aligned quality transfer further yields 2.26% and 13.23% performance gains in evaluations on single-synthetic and cross-synthetic datasets.

No-Reference Image Quality Assessment with Global-Local Progressive Integration and Semantic-Aligned Quality Transfer

TL;DR

Abstract

Paper Structure (31 sections, 11 equations, 10 figures, 9 tables)

This paper contains 31 sections, 11 equations, 10 figures, 9 tables.

Introduction
The proposed semantic-aligned quality transfer method
Statistcal Analysis
Dataset Construction
Preliminary Data Preparation
Semantic-Aligned Quality Transfer (SAQT)-IQA dataset
The Proposed GlintIQA
Global and Local Feature Extraction
VGFE
CLFE
Global and Local Feature Integration
Progressive Feature Integration
CWSA
SIEM
Image Quality Prediction
...and 16 more sections

Figures (10)

Figure 1: The correlation between image semantic distance and quality scores is demonstrated under identical distortion conditions across four IQA datasets.
Figure 2: Semantic distance analysis and dataset validation. Upper: PLCC distribution across semantic distances for KADID-10k (mean and standard deviation). Lower: Distribution of semantic distances in the proposed SAQT-IQA dataset.
Figure 3: The procedure of dataset construction based on the proposed semantic-aligned quality transfer method.
Figure 4: The framework of the proposed GlintIQA. The input image is processed by VGFE and CLFE, followed by feature alignment and progressive integration using interactively stacked CWSA and SIEM. The resulting multi-grained representation is then fed to an MLP for quality prediction.
Figure 5: Comparison of spatial dimensionality reduction in ViT and CNNs. (a): In ViT, the spatial dimensionality reduction occurs through the patch embedding process, which subdivides an input image of dimensions $H \times W$ into non-overlapping patches using a convolutional layer with a kernel size $k \times k$ and stride $k$ (e.g., $k=16$), resulting in tokens of size $H/k \times W/k$. (b): Spatial dimensionality reduction in CNNs involves applying successive convolutional layers with small kernel sizes (e.g., $3 \times 3$) and strides (e.g., 2), progressively halving the spatial dimension.
...and 5 more figures

No-Reference Image Quality Assessment with Global-Local Progressive Integration and Semantic-Aligned Quality Transfer

TL;DR

Abstract

No-Reference Image Quality Assessment with Global-Local Progressive Integration and Semantic-Aligned Quality Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (10)