Table of Contents
Fetching ...

Diffusion Model Based Visual Compensation Guidance and Visual Difference Analysis for No-Reference Image Quality Assessment

Zhaoyang Wang, Bo Hu, Mingyang Zhang, Jie Li, Leida Li, Maoguo Gong, Xinbo Gao

TL;DR

This work tackles NR-IQA by bridging pixel-level distortions and high-level semantic cues through a diffusion-model–driven framework. It introduces Diff$V^2$IQA, comprising a diffusion restoration network that yields a restored image and two noise-containing intermediates, plus two evaluation branches: Visual Compensation Guidance (ViT with noise-level embedding) and Visual Difference Analysis (RTAB with ResNet). The approach achieves state-of-the-art results on seven NR-IQA datasets, demonstrates strong cross-dataset generalization, and provides extensive ablations that validate the contribution of each component and input configuration. While effective on synthetic distortions, the method’s performance on authentic distortions benefits from future improvements in pretraining data, restoration strength, and data augmentation. Overall, Diff$V^2$IQA offers a principled, interpretable NR-IQA framework that leverages diffusion-derived high-level features to closely align with human visual self-repair mechanisms.

Abstract

Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA) methods still suffer from finding a balance between learning feature information at the pixel level of the image and capturing high-level feature information and the efficient utilization of the obtained high-level feature information remains a challenge. As a novel class of state-of-the-art (SOTA) generative model, the diffusion model exhibits the capability to model intricate relationships, enabling a comprehensive understanding of images and possessing a better learning of both high-level and low-level visual features. In view of these, we pioneer the exploration of the diffusion model into the domain of NR-IQA. Firstly, we devise a new diffusion restoration network that leverages the produced enhanced image and noise-containing images, incorporating nonlinear features obtained during the denoising process of the diffusion model, as high-level visual information. Secondly, two visual evaluation branches are designed to comprehensively analyze the obtained high-level feature information. These include the visual compensation guidance branch, grounded in the transformer architecture and noise embedding strategy, and the visual difference analysis branch, built on the ResNet architecture and the residual transposed attention block. Extensive experiments are conducted on seven public NR-IQA datasets, and the results demonstrate that the proposed model outperforms SOTA methods for NR-IQA.

Diffusion Model Based Visual Compensation Guidance and Visual Difference Analysis for No-Reference Image Quality Assessment

TL;DR

This work tackles NR-IQA by bridging pixel-level distortions and high-level semantic cues through a diffusion-model–driven framework. It introduces DiffIQA, comprising a diffusion restoration network that yields a restored image and two noise-containing intermediates, plus two evaluation branches: Visual Compensation Guidance (ViT with noise-level embedding) and Visual Difference Analysis (RTAB with ResNet). The approach achieves state-of-the-art results on seven NR-IQA datasets, demonstrates strong cross-dataset generalization, and provides extensive ablations that validate the contribution of each component and input configuration. While effective on synthetic distortions, the method’s performance on authentic distortions benefits from future improvements in pretraining data, restoration strength, and data augmentation. Overall, DiffIQA offers a principled, interpretable NR-IQA framework that leverages diffusion-derived high-level features to closely align with human visual self-repair mechanisms.

Abstract

Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA) methods still suffer from finding a balance between learning feature information at the pixel level of the image and capturing high-level feature information and the efficient utilization of the obtained high-level feature information remains a challenge. As a novel class of state-of-the-art (SOTA) generative model, the diffusion model exhibits the capability to model intricate relationships, enabling a comprehensive understanding of images and possessing a better learning of both high-level and low-level visual features. In view of these, we pioneer the exploration of the diffusion model into the domain of NR-IQA. Firstly, we devise a new diffusion restoration network that leverages the produced enhanced image and noise-containing images, incorporating nonlinear features obtained during the denoising process of the diffusion model, as high-level visual information. Secondly, two visual evaluation branches are designed to comprehensively analyze the obtained high-level feature information. These include the visual compensation guidance branch, grounded in the transformer architecture and noise embedding strategy, and the visual difference analysis branch, built on the ResNet architecture and the residual transposed attention block. Extensive experiments are conducted on seven public NR-IQA datasets, and the results demonstrate that the proposed model outperforms SOTA methods for NR-IQA.
Paper Structure (34 sections, 11 equations, 10 figures, 12 tables, 2 algorithms)

This paper contains 34 sections, 11 equations, 10 figures, 12 tables, 2 algorithms.

Figures (10)

  • Figure 1: The overview framework of our proposed Diff$V^2$IQA model, which consists of two main components. The first part comprises a diffusion restoration network designed for image enhancement, while the second part involves a two-branch image quality evaluation network.
  • Figure 2: The illustration of proposed Diffusion model based Visual compensation guidance and Visual difference analysis' IQA network (Diff$V^2$IQA). We begin by feeding the input image into a diffusion restoration network, which generates the final restored image along with two intermediate noise-containing images. These images are then processed through two branches: the Visual Compensation Guidance (VCG) branch and the Visual Difference Analysis (VDA) branch. The VCG branch, built on a vision transformer architecture, employs a noise-level embedding strategy to produce a quality score. The VDA branch, based on ResNet50, incorporates an attention mechanism to generate its own score. The outputs from both branches are then combined and analyzed to yield the final quality score.
  • Figure 3: Overview of the pre-training process of the denoising model U-Net in our diffusion restoration network.
  • Figure 4: Comparison of our proposed Residual Transposed Attention Block with the Transposed Attention Block in the MANIQA paper and the specific architecture.
  • Figure 5: gMAD competition results between Diff$V^2$IQA and MANIQA. (a) Fixed MANIQA at the low-quality level. (b) Fixed MANIQA at the high-quality level. (c) Fixed Diff$V^2$IQA at the low-quality level. (d) Fixed Diff$V^2$IQA at the high-quality level.
  • ...and 5 more figures