Table of Contents
Fetching ...

You Only Train Once: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment

Yi Ke Yun, Weisi Lin

TL;DR

The paper presents YOTO, a unified transformer-based framework that jointly handles FR and NR image quality assessment with a single training process. It combines a shared encoder with a Hierarchical Attention adaptor and a Semantic Distortion Aware module to model both spatial distortions and their semantic impact across encoder stages. Empirical results on multiple FR and NR benchmarks, including PIPAL, show state-of-the-art performance for FR and NR IQA, with joint FR/NR training further boosting NR quality estimates while maintaining FR performance. The approach offers improved consistency between FR and NR scores and holds promise for extending to multi-modal IQA scenarios in real-world applications.

Abstract

Although recent efforts in image quality assessment (IQA) have achieved promising performance, there still exists a considerable gap compared to the human visual system (HVS). One significant disparity lies in humans' seamless transition between full reference (FR) and no reference (NR) tasks, whereas existing models are constrained to either FR or NR tasks. This disparity implies the necessity of designing two distinct systems, thereby greatly diminishing the model's versatility. Therefore, our focus lies in unifying FR and NR IQA under a single framework. Specifically, we first employ an encoder to extract multi-level features from input images. Then a Hierarchical Attention (HA) module is proposed as a universal adapter for both FR and NR inputs to model the spatial distortion at each encoder stage. Furthermore, considering that different distortions contaminate encoder stages and damage image semantic meaning differently, a Semantic Distortion Aware (SDA) module is proposed to examine feature correlations between shallow and deep layers of the encoder. By adopting HA and SDA, the proposed network can effectively perform both FR and NR IQA. When our proposed model is independently trained on NR or FR IQA tasks, it outperforms existing models and achieves state-of-the-art performance. Moreover, when trained jointly on NR and FR IQA tasks, it further enhances the performance of NR IQA while achieving on-par performance in the state-of-the-art FR IQA. You only train once to perform both IQA tasks. Code will be released at: https://github.com/BarCodeReader/YOTO.

You Only Train Once: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment

TL;DR

The paper presents YOTO, a unified transformer-based framework that jointly handles FR and NR image quality assessment with a single training process. It combines a shared encoder with a Hierarchical Attention adaptor and a Semantic Distortion Aware module to model both spatial distortions and their semantic impact across encoder stages. Empirical results on multiple FR and NR benchmarks, including PIPAL, show state-of-the-art performance for FR and NR IQA, with joint FR/NR training further boosting NR quality estimates while maintaining FR performance. The approach offers improved consistency between FR and NR scores and holds promise for extending to multi-modal IQA scenarios in real-world applications.

Abstract

Although recent efforts in image quality assessment (IQA) have achieved promising performance, there still exists a considerable gap compared to the human visual system (HVS). One significant disparity lies in humans' seamless transition between full reference (FR) and no reference (NR) tasks, whereas existing models are constrained to either FR or NR tasks. This disparity implies the necessity of designing two distinct systems, thereby greatly diminishing the model's versatility. Therefore, our focus lies in unifying FR and NR IQA under a single framework. Specifically, we first employ an encoder to extract multi-level features from input images. Then a Hierarchical Attention (HA) module is proposed as a universal adapter for both FR and NR inputs to model the spatial distortion at each encoder stage. Furthermore, considering that different distortions contaminate encoder stages and damage image semantic meaning differently, a Semantic Distortion Aware (SDA) module is proposed to examine feature correlations between shallow and deep layers of the encoder. By adopting HA and SDA, the proposed network can effectively perform both FR and NR IQA. When our proposed model is independently trained on NR or FR IQA tasks, it outperforms existing models and achieves state-of-the-art performance. Moreover, when trained jointly on NR and FR IQA tasks, it further enhances the performance of NR IQA while achieving on-par performance in the state-of-the-art FR IQA. You only train once to perform both IQA tasks. Code will be released at: https://github.com/BarCodeReader/YOTO.
Paper Structure (14 sections, 10 equations, 9 figures, 8 tables)

This paper contains 14 sections, 10 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Illustration of the main purpose for the proposed network (the lower part in the figure), where different image pairs can be fed into the network to yield FR/NR IQA scores using the same architecture. The traditional framework is given as the upper part of the figure for easy comparison. Our method offers great simplicity in both training and applications, minimizing performance inconsistencies on switching FR/NR tasks, and it achieves state-of-the-art performance on both FR and NR IQA benchmarks.
  • Figure 2: Illustration of how humans assess image quality when distortion is present: which one has the highest quality score? (a) a $100\times100$ yellow block is presented in the background. (b) four blocks in both background and foreground (the basketball player). (c) a $200\times200$ block in the background. (d) a $100\times100$ block on the face. Quality score based on the amount of distortion: $(b)<(a)$ and $(c)<(a)$, while that based on the semantic impact to content (the player): $(b)<(c)$ and thus $(b)<(c)<(a)$. How about (d)? It has less amount of distortions than (b) and (c). However, it should have the lowest quality score because the most important message, the face, is damaged. Thus, aside from the amount, distortion is significant if it has critical damage to an image's semantic meaning.
  • Figure 3: Performance comparison against other FR and NR IQA models on LIVE live_dataset dataset. Our method achieves state-of-the-art performance on both FR and NR IQA benchmarks using the same network architecture.
  • Figure 4: Network architecture of the proposed YOTO. Input types can be chosen by the user depending on the presence of reference images for the network to perform FR or NR IQA tasks (i.e. the dotted red arrow will become green if the user chooses to perform NR IQA). The network receives a pair of images in the form of $[distorted, distorted]$ or $[distorted, reference]$ for NR and FR IQA respectively. ResNet50 or Swin Transformer is adopted as the encoder backbone (purple). A Hierarchical Attention (HA) module with global and regional attention is developed to highlight potential distortion-contaminated areas in encoder features. If only distorted images are provided, self-attention is applied in the HA module. If reference images are given, the HA module will compute the cross-attention between distorted and reference features. To model the semantic impact caused by distortion, a Semantic Distortion Aware (SDA) module is designed and densely applied to explore the similarity between shallow and deep features using only distorted image features. The obtained features from HA and SDA are concatenated and fused for IQA score estimation via a commonly applied patch-wise attention. Best viewed in color.
  • Figure 5: The developed Hierarchical Attention (HA) module that bridges FR and NR tasks together. Segmentation embeddings were added to input features as an indicator of NR and FR tasks. Depending on FR or NR tasks, the input pair shown above could be $\{distortion, reference\}$ or $\{distortion, distortion\}$ respectively. Besides, on top of global attention, we partitioned the attention matrix into patches to incorporate local attention as well. In practice, the HA module will stack multiple attention layers with different scale factors to fully exploit global and local attention.
  • ...and 4 more figures