Table of Contents
Fetching ...

Multi-Modal Prompt Learning on Blind Image Quality Assessment

Wensheng Pan, Timin Gao, Yan Zhang, Runze Hu, Xiawu Zheng, Enwei Zhang, Yuting Gao, Yutao Liu, Yunhang Shen, Ke Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji

TL;DR

This work addresses blind image quality assessment (BIQA) by adapting CLIP through multi-modal prompt learning. MP-IQE introduces a dual-text prompt scheme (scene category and distortion type) and deep visual prompts that are inserted at every layer of the image encoder, coupled with a multimodal encoder for cross-modal interaction. The method achieves competitive or superior SRCC/PLCC across six datasets (e.g., SRCC $0.961$ on CSIQ and $0.941$ on KADID) while maintaining data efficiency and strong cross-dataset generalization. The approach demonstrates that structured, learnable prompts in both text and visual branches enable richer quality semantics and robust IQA performance, highlighting practical potential for CLIP-based BIQA without exhaustive fine-tuning.

Abstract

Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Currently, leveraging semantic information to enhance IQA is a crucial research direction. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. However, the generalist nature of these pre-trained Vision-Language (VL) models often renders them suboptimal for IQA-specific tasks. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. Existing prompt-based VL models overly focus on incremental semantic information from text, neglecting the rich insights available from visual data analysis. This imbalance limits their performance improvements in IQA tasks. This paper introduces an innovative multi-modal prompt-based methodology for IQA. Our approach employs carefully crafted prompts that synergistically mine incremental semantic information from both visual and linguistic data. Specifically, in the visual branch, we introduce a multi-layer prompt structure to enhance the VL model's adaptability. In the text branch, we deploy a dual-prompt scheme that steers the model to recognize and differentiate between scene category and distortion type, thereby refining the model's capacity to assess image quality. Our experimental findings underscore the effectiveness of our method over existing Blind Image Quality Assessment (BIQA) approaches. Notably, it demonstrates competitive performance across various datasets. Our method achieves Spearman Rank Correlation Coefficient (SRCC) values of 0.961(surpassing 0.946 in CSIQ) and 0.941 (exceeding 0.930 in KADID), illustrating its robustness and accuracy in diverse contexts.

Multi-Modal Prompt Learning on Blind Image Quality Assessment

TL;DR

This work addresses blind image quality assessment (BIQA) by adapting CLIP through multi-modal prompt learning. MP-IQE introduces a dual-text prompt scheme (scene category and distortion type) and deep visual prompts that are inserted at every layer of the image encoder, coupled with a multimodal encoder for cross-modal interaction. The method achieves competitive or superior SRCC/PLCC across six datasets (e.g., SRCC on CSIQ and on KADID) while maintaining data efficiency and strong cross-dataset generalization. The approach demonstrates that structured, learnable prompts in both text and visual branches enable richer quality semantics and robust IQA performance, highlighting practical potential for CLIP-based BIQA without exhaustive fine-tuning.

Abstract

Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Currently, leveraging semantic information to enhance IQA is a crucial research direction. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. However, the generalist nature of these pre-trained Vision-Language (VL) models often renders them suboptimal for IQA-specific tasks. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. Existing prompt-based VL models overly focus on incremental semantic information from text, neglecting the rich insights available from visual data analysis. This imbalance limits their performance improvements in IQA tasks. This paper introduces an innovative multi-modal prompt-based methodology for IQA. Our approach employs carefully crafted prompts that synergistically mine incremental semantic information from both visual and linguistic data. Specifically, in the visual branch, we introduce a multi-layer prompt structure to enhance the VL model's adaptability. In the text branch, we deploy a dual-prompt scheme that steers the model to recognize and differentiate between scene category and distortion type, thereby refining the model's capacity to assess image quality. Our experimental findings underscore the effectiveness of our method over existing Blind Image Quality Assessment (BIQA) approaches. Notably, it demonstrates competitive performance across various datasets. Our method achieves Spearman Rank Correlation Coefficient (SRCC) values of 0.961(surpassing 0.946 in CSIQ) and 0.941 (exceeding 0.930 in KADID), illustrating its robustness and accuracy in diverse contexts.
Paper Structure (34 sections, 16 equations, 4 figures, 6 tables)

This paper contains 34 sections, 16 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison between CLIP-IQA, CLIP-IQA$^{+}$ and the proposed method. (a) CLIP-IQA with antonym prompt pairing strategy. (b) CLIP-IQA$^{+}$ introduced CoOp to learning suitable prompt pairs. (c) Our approach is with (1) a dual-prompt scheme in the text branch and (2) deep prompts in the visual branch.
  • Figure 2: Overview of our proposed MP-IQE. In the text branch, we introduce the scene prompt and distortion prompt to align the class token embedding and patch embeddings, respectively. In the visual branch, we incorporate a multi-layer prompt to enhance the pre-trained model adaptability. Finally, scene embeddings and distortion embeddings are concatenated as the query of the multi-modal encoder and perform the cross-attention with image features. Then, the multi-modal features are sent to an MLP layer to predict the quality score.
  • Figure 3: Comparison of activation maps between the visual-only model and our model using Grad-CAM. Rows 1-3 respectively display the original images, CAMs from the visual-only model, and CAMs from our model. The first line of numbers beneath each image represents the ground truth of the original images. The numbers in lines 2-3 represent the predicted quality score, with the numbers in parentheses indicating the distance between our model and the visual-only model.
  • Figure 4: The t-SNE visualization presents the semantic features of the LIVEC training set using our MP-IQE. Different colors in the visualization represent different quality score intervals.