Table of Contents
Fetching ...

Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment

Yuan Li, Yahan Yu, Youyuan Lin, Yong-Hao Yang, Chenhui Chu, Shin'ya Nishida

TL;DR

This work tackles blind image quality assessment (BIQA) by embedding a human-like perception–reasoning cascade into multimodal learning. It introduces the Q-Reasoning dataset to capture eight perception and reasoning dimensions, and trains a large language model with human-guided reinforcement learning, augmented by a self-consistency objective that requires predicting quality from its own captions. The approach achieves competitive image-based quality predictions and significantly improves alignment with human reasoning ( ROUGE-1 ), demonstrating interpretable, human-centered BIQA under both image-conditioned and caption-conditioned settings. The work also proposes caption-based BIQA as a meaningful evaluation dimension, moving BIQA toward interpretable, human-aligned decision making.

Abstract

Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.

Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment

TL;DR

This work tackles blind image quality assessment (BIQA) by embedding a human-like perception–reasoning cascade into multimodal learning. It introduces the Q-Reasoning dataset to capture eight perception and reasoning dimensions, and trains a large language model with human-guided reinforcement learning, augmented by a self-consistency objective that requires predicting quality from its own captions. The approach achieves competitive image-based quality predictions and significantly improves alignment with human reasoning ( ROUGE-1 ), demonstrating interpretable, human-centered BIQA under both image-conditioned and caption-conditioned settings. The work also proposes caption-based BIQA as a meaningful evaluation dimension, moving BIQA toward interpretable, human-aligned decision making.

Abstract

Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.

Paper Structure

This paper contains 21 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Aligning Model Reasoning with Human Judgments in Blind Image Quality Assessment.Left: Comparison between image-conditioned and caption-conditioned quality evaluations. Conventional models (here we test Q-Instruct model q-instruct as the supervised fine-tuning (SFT) model and Q-Insight model q-insight as the reinforcement learning (RL) model) yield inconsistent scores between image and caption input, while our model aligns with human judgments with consistent scores between them. Right: Illustration of quality reasoning processes across model types. SFT-based models are supervised on captions and ratings but lack explicit reasoning guidance; existing RL-based models focus on score optimization. Humans reason about image quality through interpretable judgment criteria, enabling consistent assessment with or without direct visual input. Our model is jointly guided on reasoning and rating, mirroring the human evaluation process.
  • Figure 2: Overview of the Q-Reasoning dataset.
  • Figure 3: Overview of the Proposed Human-Like Reasoning Framework. The training process involves two reasoning stages. In the first reasoning stage, the model receives both an image input and a textual prompt. The total reward combines three components: (1) Reasoning reward, measuring the similarity between the model’s generated explanation and human annotations; (2) Prediction reward, aligning the predicted score with human ratings; and (3) Format reward, enforcing structural consistency in the output. In the second reasoning stage, the model takes its previously generated caption and the same prompt as input, and is optimized with a Self-consistency reward. This dual-stage design encourages the first-stage policy to learn human-like perception and quality judgment, while the second-stage reasoning promotes internalization of human-like judgment criteria.
  • Figure 4: Case study on model-human reasoning alignment. We compare the SFT-based Q-Instructq-instruct, the RL-based Q-Insight-Scoreq-insight, and our model. Green text indicates reasoning parts consistent with human annotations, while red text highlights inconsistencies. ROUGE-1 score rouge measures how well the model's reasoning captures human reasoning content.
  • Figure S1: Prompt Template Robustness. We evaluate our base and detailed models under template (b) and template (c). Green highlights indicate perception-reasoning components that remain consistent across templates. Both models successfully adapt to new template patterns and produce stable quality predictions. The detailed model further demonstrates strong human alignment, achieving a ROUGE-1 rouge score of 0.648 under both templates, indicating that its reasoning behavior remains consistent and robust regardless of the prompt structure.
  • ...and 2 more figures