Table of Contents
Fetching ...

ExIQA: Explainable Image Quality Assessment Using Distortion Attributes

Sepehr Kazemi Ranjbar, Emad Fatemizadeh

TL;DR

This paper proposes an explainable approach for distortion identification based on attribute learning that achieves state-of-the-art (SOTA) performance across multiple datasets in both PLCC and SRCC metrics and demonstrates the generalizability of the proposed approach.

Abstract

Blind Image Quality Assessment (BIQA) aims to develop methods that estimate the quality scores of images in the absence of a reference image. In this paper, we approach BIQA from a distortion identification perspective, where our primary goal is to predict distortion types and strengths using Vision-Language Models (VLMs), such as CLIP, due to their extensive knowledge and generalizability. Based on these predicted distortions, we then estimate the quality score of the image. To achieve this, we propose an explainable approach for distortion identification based on attribute learning. Instead of prompting VLMs with the names of distortions, we prompt them with the attributes or effects of distortions and aggregate this information to infer the distortion strength. Additionally, we consider multiple distortions per image, making our method more scalable. To support this, we generate a dataset consisting of 100,000 images for efficient training. Finally, attribute probabilities are retrieved and fed into a regressor to predict the image quality score. The results show that our approach, besides its explainability and transparency, achieves state-of-the-art (SOTA) performance across multiple datasets in both PLCC and SRCC metrics. Moreover, the zero-shot results demonstrate the generalizability of the proposed approach.

ExIQA: Explainable Image Quality Assessment Using Distortion Attributes

TL;DR

This paper proposes an explainable approach for distortion identification based on attribute learning that achieves state-of-the-art (SOTA) performance across multiple datasets in both PLCC and SRCC metrics and demonstrates the generalizability of the proposed approach.

Abstract

Blind Image Quality Assessment (BIQA) aims to develop methods that estimate the quality scores of images in the absence of a reference image. In this paper, we approach BIQA from a distortion identification perspective, where our primary goal is to predict distortion types and strengths using Vision-Language Models (VLMs), such as CLIP, due to their extensive knowledge and generalizability. Based on these predicted distortions, we then estimate the quality score of the image. To achieve this, we propose an explainable approach for distortion identification based on attribute learning. Instead of prompting VLMs with the names of distortions, we prompt them with the attributes or effects of distortions and aggregate this information to infer the distortion strength. Additionally, we consider multiple distortions per image, making our method more scalable. To support this, we generate a dataset consisting of 100,000 images for efficient training. Finally, attribute probabilities are retrieved and fed into a regressor to predict the image quality score. The results show that our approach, besides its explainability and transparency, achieves state-of-the-art (SOTA) performance across multiple datasets in both PLCC and SRCC metrics. Moreover, the zero-shot results demonstrate the generalizability of the proposed approach.
Paper Structure (26 sections, 8 equations, 3 figures, 5 tables)

This paper contains 26 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Use of LLMs, such as GPT-3, to generate visual effects/attributes of distortions. Each distortion with an explanation is fed to the LLM. In output, we have $K$ visual attributes for that distortion.
  • Figure 2: The architecture of our model. We consider two distortions for visual simplicity: Impulse Noise and Gaussian Blur. The image also has both of them with strengths of 0.6 and 0.8. The reference image is shown just for comparison with the distorted image. The pipeline is as follows: 1) The distorted image is encoded. 2) The positive and negative prompts for all attributes are generated (two distortions, each has $K$ attributes). Then, the probability of each attribute is computed. Finally, for each distortion, its $K$ attribute probabilities are averaged to obtain the distortion probability. 3) Attributes probabilities are retrieved and fed to the regressor network for prediction of the quality score.
  • Figure 3: Saliency Maps.

Theorems & Definitions (2)

  • Definition 1: Sub-Problem 1
  • Definition 2: Sub-Problem 2