Teaching LMMs for Image Quality Scoring and Interpreting
Zicheng Zhang, Haoning Wu, Ziheng Jia, Weisi Lin, Guangtao Zhai
TL;DR
This paper introduces Q-SiT, a unified framework that jointly teaches large multimodal models to perform image quality scoring and interpreting. It achieves this by converting traditional IQA data into a QA format for scoring and constructing a large low-level pathway-interpretation dataset to train interpreting, aided by a dynamic data-mix strategy that optimizes the proportion of tasks to mitigate interference. A lightweight variant, Q-SiT-mini, demonstrates significant efficiency gains with competitive performance, enabling practical deployment. Experimental results show strong cross-dataset generalization for scoring and robust interpreting capabilities, validating the proposed balance strategy and the integration of high-level semantic knowledge with low-level perceptual cues. The work lays a foundation for end-to-end, interpretable IQA systems in LMMs and informs future research on joint perceptual and decision-making tasks.
Abstract
Image quality scoring and interpreting are two fundamental components of Image Quality Assessment (IQA). The former quantifies image quality, while the latter enables descriptive question answering about image quality. Traditionally, these two tasks have been addressed independently. However, from the perspective of the Human Visual System (HVS) and the Perception-Decision Integration Model, they are inherently interconnected: interpreting serves as the foundation for scoring, while scoring provides an abstract summary of interpreting. Thus, unifying these capabilities within a single model is both intuitive and logically coherent. In this paper, we propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables large multimodal models (LMMs) to learn both image quality scoring and interpreting simultaneously. We achieve this by transforming conventional IQA datasets into learnable question-answering datasets and incorporating human-annotated quality interpreting data for training. Furthermore, we introduce an efficient scoring & interpreting balance strategy, which first determines the optimal data mix ratio on lightweight LMMs and then maps this ratio to primary LMMs for fine-tuning adjustment. This strategy not only mitigates task interference and enhances cross-task knowledge transfer but also significantly reduces computational costs compared to direct optimization on full-scale LMMs. With this joint learning framework and corresponding training strategy, we develop Q-SiT, the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini. Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.Project page at https://github.com/Q-Future/Q-SiT.
