Table of Contents
Fetching ...

VisualCritic: Making LMMs Perceive Visual Quality Like Humans

Zhipeng Huang, Zhizheng Zhang, Yiting Lu, Zheng-Jun Zha, Zhibo Chen, Baining Guo

TL;DR

VisualCritic addresses the gap that large multimodal models struggle with low-level visual quality perception by introducing the first LMM capable of broad-spectrum image subjective quality assessment. It combines a frozen vision backbone (EVA), a cross-modality adapter, and a LoRA-tuned Vicuna-13B decoder, trained via a three-stage curriculum that emphasizes relativity learning to overcome annotation incongruity and achieve robust cross-dataset generalization. The model delivers MOS-based quantitative metrics, qualitative quality descriptions, and authenticity detection across diverse datasets (photographic and AI-generated), achieving state-of-the-art cross-dataset correlations with human judgments while remaining usable out of the box. This generalist visual quality assessor can serve as a versatile tool for visual quality alignment and potentially as a reward model for AIGC, with explanatory outputs and uncertainty awareness to support practical deployment.

Abstract

At present, large multimodal models (LMMs) have exhibited impressive generalization capabilities in understanding and generating visual signals. However, they currently still lack sufficient capability to perceive low-level visual quality akin to human perception. Can LMMs achieve this and show the same degree of generalization in this regard? If so, not only could the versatility of LMMs be further enhanced, but also the challenge of poor cross-dataset performance in the field of visual quality assessment could be addressed. In this paper, we explore this question and provide the answer "Yes!". As the result of this initial exploration, we present VisualCritic, the first LMM for broad-spectrum image subjective quality assessment. VisualCritic can be used across diverse data right out of box, without any requirements of dataset-specific adaptation operations like conventional specialist models. As an instruction-following LMM, VisualCritic enables new capabilities of (1) quantitatively measuring the perceptual quality of given images in terms of their Mean Opinion Score (MOS), noisiness, colorfulness, sharpness, and other numerical indicators, (2) qualitatively evaluating visual quality and providing explainable descriptions, (3) discerning whether a given image is AI-generated or photographic. Extensive experiments demonstrate the efficacy of VisualCritic by comparing it with other open-source LMMs and conventional specialist models over both AI-generated and photographic images.

VisualCritic: Making LMMs Perceive Visual Quality Like Humans

TL;DR

VisualCritic addresses the gap that large multimodal models struggle with low-level visual quality perception by introducing the first LMM capable of broad-spectrum image subjective quality assessment. It combines a frozen vision backbone (EVA), a cross-modality adapter, and a LoRA-tuned Vicuna-13B decoder, trained via a three-stage curriculum that emphasizes relativity learning to overcome annotation incongruity and achieve robust cross-dataset generalization. The model delivers MOS-based quantitative metrics, qualitative quality descriptions, and authenticity detection across diverse datasets (photographic and AI-generated), achieving state-of-the-art cross-dataset correlations with human judgments while remaining usable out of the box. This generalist visual quality assessor can serve as a versatile tool for visual quality alignment and potentially as a reward model for AIGC, with explanatory outputs and uncertainty awareness to support practical deployment.

Abstract

At present, large multimodal models (LMMs) have exhibited impressive generalization capabilities in understanding and generating visual signals. However, they currently still lack sufficient capability to perceive low-level visual quality akin to human perception. Can LMMs achieve this and show the same degree of generalization in this regard? If so, not only could the versatility of LMMs be further enhanced, but also the challenge of poor cross-dataset performance in the field of visual quality assessment could be addressed. In this paper, we explore this question and provide the answer "Yes!". As the result of this initial exploration, we present VisualCritic, the first LMM for broad-spectrum image subjective quality assessment. VisualCritic can be used across diverse data right out of box, without any requirements of dataset-specific adaptation operations like conventional specialist models. As an instruction-following LMM, VisualCritic enables new capabilities of (1) quantitatively measuring the perceptual quality of given images in terms of their Mean Opinion Score (MOS), noisiness, colorfulness, sharpness, and other numerical indicators, (2) qualitatively evaluating visual quality and providing explainable descriptions, (3) discerning whether a given image is AI-generated or photographic. Extensive experiments demonstrate the efficacy of VisualCritic by comparing it with other open-source LMMs and conventional specialist models over both AI-generated and photographic images.
Paper Structure (22 sections, 1 equation, 7 figures, 11 tables)

This paper contains 22 sections, 1 equation, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Illustration of the comparison between existing LMMs and our proposed VisualCritic for visual subjective quality assessment from the perspectives of quantitative measurement, qualitative evaluation and authenticity detection. The results show VisualCritic is the best one of its kind to perform consistently well over different relevant tasks.
  • Figure 2: The framework of our proposed VisualCritic, which comprises a frozen vision encoder, a learned cross-modality adapter and a LoRA-tuned LLM decoder. VisualCritic is the first of its kinds to support diverse visual quality assessment tasks, including relative quality comparison, quantitative measurement, qualitative evaluation and authenticity detection.
  • Figure 3: LLM prompts for data construction. These prompts from left to right are for generating VisualCritic's response (i.e., $\{answer\_content\}$) in the training data of relativity learning, quantitative measurement, qualitative evaluation and authenticity detection, respectively.
  • Figure 4: LLM prompts for data construction, i.e., generating the $\{instruction\_content\}$ in the training data.
  • Figure 5: More comparison results between our VisualCritic and other LMMs on qualitative evaluation. Errors are red-highlighted.
  • ...and 2 more figures