Table of Contents
Fetching ...

Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment

Yuan Li, Zitang Sun, Yen-Ju Chen, Shin'ya Nishida

TL;DR

This paper interrogates whether multi-modal large language models truly perceive low-level visual distortions essential for image quality assessment. It introduces a unified distortion-perception dataset spanning four IQA corpora and analyzes perception through a two-stage lens: visual feature extraction and subsequent reasoning, using a three-component MLLM (vision encoder, projector, LM). The key finding is that fine-tuning the vision path dramatically improves distortion recognition and tightens alignment with distortion semantics, while overfitting to templates during training can degrade perceptual fidelity. The results suggest that constraining the vision encoder is critical for robust, interpretable vision-centric reasoning in MLLMs and for producing coherent explanations in IQA tasks.

Abstract

Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.

Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment

TL;DR

This paper interrogates whether multi-modal large language models truly perceive low-level visual distortions essential for image quality assessment. It introduces a unified distortion-perception dataset spanning four IQA corpora and analyzes perception through a two-stage lens: visual feature extraction and subsequent reasoning, using a three-component MLLM (vision encoder, projector, LM). The key finding is that fine-tuning the vision path dramatically improves distortion recognition and tightens alignment with distortion semantics, while overfitting to templates during training can degrade perceptual fidelity. The results suggest that constraining the vision encoder is critical for robust, interpretable vision-centric reasoning in MLLMs and for producing coherent explanations in IQA tasks.

Abstract

Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.

Paper Structure

This paper contains 14 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Performance on Visual Perception. The left panel highlights the inconsistency of an IQA model’s perceptions. Meanwhile, the right panel compares how the naive and fine-tuned models classify distortion types. Both models are based on the mPLUG-Owl2 architecture mplug-owl2, while the IQA model used for quality assessment is Q-Instruct qinstruct. The evaluated sample is a compressed image from the LIVE dataset live.
  • Figure 2: MLLM Structure. MLLMs integrate three key components: a vision model, a projector model, and a language model. First, visual inputs are processed by the vision encoder, which extracts visual tokens. These tokens are then projected into a shared semantic space using the projector. In this shared space, both the visual features and the language features are jointly processed by the language model, enabling cross-modality understanding. The entire system can be fine-tuned using language model constraints.
  • Figure 3: Semantic Distance. The visual tokens are transferred into semantic space via a vision encoder and projector. There are two direct insights of visual representation. One is comparing the cosine similarity between transferred visual tokens and the language tokens in semantic space. The other is measuring the probability (logit) of the next token.
  • Figure 4: Conversation Template for Distortion Perception.
  • Figure 5: Accuracy comparison tested on mixed datasets. We present the confusion matrix for distortion classification tasks. The Vision Extractor (representing the visual encoder and projector) plays a crucial role. Fine-tuning the Vision Extractor alone significantly improves distortion perception, indicating that LLM reasoning is not the primary component in this case.
  • ...and 1 more figures