Table of Contents
Fetching ...

Teaching LMMs for Image Quality Scoring and Interpreting

Zicheng Zhang, Haoning Wu, Ziheng Jia, Weisi Lin, Guangtao Zhai

TL;DR

This paper introduces Q-SiT, a unified framework that jointly teaches large multimodal models to perform image quality scoring and interpreting. It achieves this by converting traditional IQA data into a QA format for scoring and constructing a large low-level pathway-interpretation dataset to train interpreting, aided by a dynamic data-mix strategy that optimizes the proportion of tasks to mitigate interference. A lightweight variant, Q-SiT-mini, demonstrates significant efficiency gains with competitive performance, enabling practical deployment. Experimental results show strong cross-dataset generalization for scoring and robust interpreting capabilities, validating the proposed balance strategy and the integration of high-level semantic knowledge with low-level perceptual cues. The work lays a foundation for end-to-end, interpretable IQA systems in LMMs and informs future research on joint perceptual and decision-making tasks.

Abstract

Image quality scoring and interpreting are two fundamental components of Image Quality Assessment (IQA). The former quantifies image quality, while the latter enables descriptive question answering about image quality. Traditionally, these two tasks have been addressed independently. However, from the perspective of the Human Visual System (HVS) and the Perception-Decision Integration Model, they are inherently interconnected: interpreting serves as the foundation for scoring, while scoring provides an abstract summary of interpreting. Thus, unifying these capabilities within a single model is both intuitive and logically coherent. In this paper, we propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables large multimodal models (LMMs) to learn both image quality scoring and interpreting simultaneously. We achieve this by transforming conventional IQA datasets into learnable question-answering datasets and incorporating human-annotated quality interpreting data for training. Furthermore, we introduce an efficient scoring & interpreting balance strategy, which first determines the optimal data mix ratio on lightweight LMMs and then maps this ratio to primary LMMs for fine-tuning adjustment. This strategy not only mitigates task interference and enhances cross-task knowledge transfer but also significantly reduces computational costs compared to direct optimization on full-scale LMMs. With this joint learning framework and corresponding training strategy, we develop Q-SiT, the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini. Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.Project page at https://github.com/Q-Future/Q-SiT.

Teaching LMMs for Image Quality Scoring and Interpreting

TL;DR

This paper introduces Q-SiT, a unified framework that jointly teaches large multimodal models to perform image quality scoring and interpreting. It achieves this by converting traditional IQA data into a QA format for scoring and constructing a large low-level pathway-interpretation dataset to train interpreting, aided by a dynamic data-mix strategy that optimizes the proportion of tasks to mitigate interference. A lightweight variant, Q-SiT-mini, demonstrates significant efficiency gains with competitive performance, enabling practical deployment. Experimental results show strong cross-dataset generalization for scoring and robust interpreting capabilities, validating the proposed balance strategy and the integration of high-level semantic knowledge with low-level perceptual cues. The work lays a foundation for end-to-end, interpretable IQA systems in LMMs and informs future research on joint perceptual and decision-making tasks.

Abstract

Image quality scoring and interpreting are two fundamental components of Image Quality Assessment (IQA). The former quantifies image quality, while the latter enables descriptive question answering about image quality. Traditionally, these two tasks have been addressed independently. However, from the perspective of the Human Visual System (HVS) and the Perception-Decision Integration Model, they are inherently interconnected: interpreting serves as the foundation for scoring, while scoring provides an abstract summary of interpreting. Thus, unifying these capabilities within a single model is both intuitive and logically coherent. In this paper, we propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables large multimodal models (LMMs) to learn both image quality scoring and interpreting simultaneously. We achieve this by transforming conventional IQA datasets into learnable question-answering datasets and incorporating human-annotated quality interpreting data for training. Furthermore, we introduce an efficient scoring & interpreting balance strategy, which first determines the optimal data mix ratio on lightweight LMMs and then maps this ratio to primary LMMs for fine-tuning adjustment. This strategy not only mitigates task interference and enhances cross-task knowledge transfer but also significantly reduces computational costs compared to direct optimization on full-scale LMMs. With this joint learning framework and corresponding training strategy, we develop Q-SiT, the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini. Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.Project page at https://github.com/Q-Future/Q-SiT.

Paper Structure

This paper contains 32 sections, 9 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: The Perception-Decision Integration Modelperceptiondecision for image quality scoring and interpreting tasks recognizes that these two processes are deeply interconnected. While most previous approaches treat them separately with distinct models, interpreting (the perceptual process) and scoring (the decision-making process) are not independent from the human vision system (HVS). Instead, they are integral components of a unified evaluation framework.
  • Figure 2: Overview of the proposed Q-SiT framework. The image quality scoring framework (Part1) trains LMMs to evaluate images by assigning probabilities to text-defined quality levels. The image quality interpreting framework (Part2) incorporates large-scale human annotations to inject low-level knowledge into the LMMs. Additionally, we introduce a balance strategy (Part3) to regulate the data ratio, ensuring robust performance across both tasks.
  • Figure 3: This illustration outlines the human annotation process, which typically involves three stages: 1) Training human raters with text-defined rating levels. To simulate this, we propose a rating-level-based syllabus tailored for LMMs. 2) Collecting human ratings. Raters either select a level (Type 1) or adjust a level-guided slider to score (Type 2), without directly entering the score in either method. 3) Converting initial ratings to MOS through a weighted average. In this final stage, we propose a probability-based inference approach for LMMs to predict the final scores.
  • Figure 4: Visualization of curves for predicting the coarse-grained optimal data mix ratio. A fourth-degree polynomial regression model is used for curve fitting. Figure 'D2:D3 Ratio' illustrates the performance trend of LLVisionQA as the ratio changes. Figure '$|$D2$^*$+D3$^*$$|$:D1 Ratio' depicts the average performance trend of IQA + LLVisionQA in response to ratio variations, where '$|$D2$^*$+D3$^*$$|$' indicates D2&D3 mixed datasets with previous predicted optimal ratio.
  • Figure 5: An overview of the performance across a series of image quality LMMs. Q-Instruct q-instruct is trained only on image quality interpreting data, while Q-Align qalign is trained only on image quality scoring data. Q-SiT and Q-SiT-mini are the proposed models, which are trained on both scoring and interpreting data. Performance on IQA datasets is measured as (SRCC+PLCC)/2. The specific scale is determined by normalizing each performance value against the maximum performance value observed for the corresponding dataset. 'Avg.' represents the model's average performance across all datasets.
  • ...and 2 more figures