Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning
Yiting Lu, Xin Li, Haoning Wu, Bingchen Li, Weisi Lin, Zhibo Chen
TL;DR
The paper tackles Explainable Image Quality Assessment by addressing conflicts between two explanation tasks using Large Multi-modal Foundation Models. It introduces Q-Adapt, a progressive instruction-tuning framework consisting of a universal perception knowledge learning stage with LoRA and an instruction-guided visual prompting stage employing a bi-directional V-T Generator and T-V Prompter to dynamically adapt visual features to task instructions. This approach yields a lightweight 3B-parameter model with LoRA that achieves competitive or superior results on perceptual benchmarks and widely used IQA datasets, while reducing task conflicts between overall quality explanations and attribute-wise perception answering. The work demonstrates that structured, stage-wise adaptation with flexible visual prompts enhances explainable IQA and offers practical tooling for efficient EIQA deployment in real-world applications.
Abstract
The rapid advancement of Large Multi-modal Foundation Models (LMM) has paved the way for the possible Explainable Image Quality Assessment (EIQA) with instruction tuning from two perspectives: overall quality explanation, and attribute-wise perception answering. However, existing works usually overlooked the conflicts between these two types of perception explanations during joint instruction tuning, leading to insufficient perception understanding. To mitigate this, we propose a new paradigm for perception-oriented instruction tuning, i.e., Q-Adapt, which aims to eliminate the conflicts and achieve the synergy between these two EIQA tasks when adapting LMM, resulting in enhanced multi-faceted explanations of IQA. Particularly, we propose a progressive instruction tuning strategy by dividing the adaption process of LMM for EIQA into two stages, where the first stage empowers the LMM with universal perception knowledge tailored for two tasks using an efficient transfer learning strategy, i.e., LoRA, and the second stage introduces the instruction-adaptive visual prompt tuning to dynamically adapt visual features for the different instructions from two tasks. In this way, our proposed Q-Adapt can achieve a lightweight visual quality evaluator, demonstrating comparable performance and, in some instances, superior results across perceptual-related benchmarks and commonly-used IQA databases. The source code is publicly available at https://github.com/yeppp27/Q-Adapt.
