Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment
Yuan Li, Zitang Sun, Yen-Ju Chen, Shin'ya Nishida
TL;DR
This paper interrogates whether multi-modal large language models truly perceive low-level visual distortions essential for image quality assessment. It introduces a unified distortion-perception dataset spanning four IQA corpora and analyzes perception through a two-stage lens: visual feature extraction and subsequent reasoning, using a three-component MLLM (vision encoder, projector, LM). The key finding is that fine-tuning the vision path dramatically improves distortion recognition and tightens alignment with distortion semantics, while overfitting to templates during training can degrade perceptual fidelity. The results suggest that constraining the vision encoder is critical for robust, interpretable vision-centric reasoning in MLLMs and for producing coherent explanations in IQA tasks.
Abstract
Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.
