Diffusion Model Based Visual Compensation Guidance and Visual Difference Analysis for No-Reference Image Quality Assessment
Zhaoyang Wang, Bo Hu, Mingyang Zhang, Jie Li, Leida Li, Maoguo Gong, Xinbo Gao
TL;DR
This work tackles NR-IQA by bridging pixel-level distortions and high-level semantic cues through a diffusion-model–driven framework. It introduces Diff$V^2$IQA, comprising a diffusion restoration network that yields a restored image and two noise-containing intermediates, plus two evaluation branches: Visual Compensation Guidance (ViT with noise-level embedding) and Visual Difference Analysis (RTAB with ResNet). The approach achieves state-of-the-art results on seven NR-IQA datasets, demonstrates strong cross-dataset generalization, and provides extensive ablations that validate the contribution of each component and input configuration. While effective on synthetic distortions, the method’s performance on authentic distortions benefits from future improvements in pretraining data, restoration strength, and data augmentation. Overall, Diff$V^2$IQA offers a principled, interpretable NR-IQA framework that leverages diffusion-derived high-level features to closely align with human visual self-repair mechanisms.
Abstract
Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA) methods still suffer from finding a balance between learning feature information at the pixel level of the image and capturing high-level feature information and the efficient utilization of the obtained high-level feature information remains a challenge. As a novel class of state-of-the-art (SOTA) generative model, the diffusion model exhibits the capability to model intricate relationships, enabling a comprehensive understanding of images and possessing a better learning of both high-level and low-level visual features. In view of these, we pioneer the exploration of the diffusion model into the domain of NR-IQA. Firstly, we devise a new diffusion restoration network that leverages the produced enhanced image and noise-containing images, incorporating nonlinear features obtained during the denoising process of the diffusion model, as high-level visual information. Secondly, two visual evaluation branches are designed to comprehensively analyze the obtained high-level feature information. These include the visual compensation guidance branch, grounded in the transformer architecture and noise embedding strategy, and the visual difference analysis branch, built on the ResNet architecture and the residual transposed attention block. Extensive experiments are conducted on seven public NR-IQA datasets, and the results demonstrate that the proposed model outperforms SOTA methods for NR-IQA.
