Table of Contents
Fetching ...

Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models

Zicheng Zhang, Haoning Wu, Zhongpeng Ji, Chunyi Li, Erli Zhang, Wei Sun, Xiaohong Liu, Xiongkuo Min, Fengyu Sun, Shangling Jui, Weisi Lin, Guangtao Zhai

TL;DR

This work addresses the limited use of multi-modality LLMs for visual quality assessment by introducing Q-Boost, a framework that enhances zero-shot IQA and VQA through Triadic-Tone Integration (adding neutral prompts) and Multi-Prompt Ensemble (diverse prompt groups). The approach converts quality evaluation into logit-based scoring around a [SCORE_TOKEN], integrating results with softmax to produce a quality score while keeping inference costs low. Experiments with mPO-7B variants show state-of-the-art zero-shot performance on IQA and VQA benchmarks, with Ablation indicating TTI benefits IQA modestly and MPE primarily aiding VQA. The results suggest a practical, low-cost path to extend low-level vision capabilities of MLLMs and encourage further research into text-supervised quality assessment.

Abstract

Recent advancements in Multi-modality Large Language Models (MLLMs) have demonstrated remarkable capabilities in complex high-level vision tasks. However, the exploration of MLLM potential in visual quality assessment, a vital aspect of low-level vision, remains limited. To address this gap, we introduce Q-Boost, a novel strategy designed to enhance low-level MLLMs in image quality assessment (IQA) and video quality assessment (VQA) tasks, which is structured around two pivotal components: 1) Triadic-Tone Integration: Ordinary prompt design simply oscillates between the binary extremes of $positive$ and $negative$. Q-Boost innovates by incorporating a `middle ground' approach through $neutral$ prompts, allowing for a more balanced and detailed assessment. 2) Multi-Prompt Ensemble: Multiple quality-centric prompts are used to mitigate bias and acquire more accurate evaluation. The experimental results show that the low-level MLLMs exhibit outstanding zeros-shot performance on the IQA/VQA tasks equipped with the Q-Boost strategy.

Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models

TL;DR

This work addresses the limited use of multi-modality LLMs for visual quality assessment by introducing Q-Boost, a framework that enhances zero-shot IQA and VQA through Triadic-Tone Integration (adding neutral prompts) and Multi-Prompt Ensemble (diverse prompt groups). The approach converts quality evaluation into logit-based scoring around a [SCORE_TOKEN], integrating results with softmax to produce a quality score while keeping inference costs low. Experiments with mPO-7B variants show state-of-the-art zero-shot performance on IQA and VQA benchmarks, with Ablation indicating TTI benefits IQA modestly and MPE primarily aiding VQA. The results suggest a practical, low-cost path to extend low-level vision capabilities of MLLMs and encourage further research into text-supervised quality assessment.

Abstract

Recent advancements in Multi-modality Large Language Models (MLLMs) have demonstrated remarkable capabilities in complex high-level vision tasks. However, the exploration of MLLM potential in visual quality assessment, a vital aspect of low-level vision, remains limited. To address this gap, we introduce Q-Boost, a novel strategy designed to enhance low-level MLLMs in image quality assessment (IQA) and video quality assessment (VQA) tasks, which is structured around two pivotal components: 1) Triadic-Tone Integration: Ordinary prompt design simply oscillates between the binary extremes of and . Q-Boost innovates by incorporating a `middle ground' approach through prompts, allowing for a more balanced and detailed assessment. 2) Multi-Prompt Ensemble: Multiple quality-centric prompts are used to mitigate bias and acquire more accurate evaluation. The experimental results show that the low-level MLLMs exhibit outstanding zeros-shot performance on the IQA/VQA tasks equipped with the Q-Boost strategy.
Paper Structure (14 sections, 5 equations, 3 figures, 5 tables)

This paper contains 14 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Motivation of Q-Boost. The Triadic-Tone Integration strategy helps provide a more balanced and comprehensive assessment while the Multi-Prompt Ensemble strategy helps improve the accuracy and reliability of evaluation.
  • Figure 2: The framework of the Q-Boost. The image and prompt are fed into the MLLM, where the log probabilities (logits) are computed between the [SCORE_TOKEN] and triadic-tone words (including ensembles of multiple prompts). Then the logits of different tones are put through softmax operation and integrated into the zero-shot quality score with weighted average pooling.
  • Figure 3: The SRCC (Fig. (a)) and PLCC (Fig. (b)) performance comparison of the best zero-shot competitor, mPO-7B (Q-Instruct), and mPO-7B (Q-Boost), where the index values are calculated as $\frac{SRCC}{SRCC_{max}}$ and $\frac{PLCC}{PLCC_{max}}$. It can be seen that mPO-7B (Q-Boost) achieves the best performance in general and significantly boosts the performance of mPO-7B (Q-Instruct) on the VQA datasets.