Table of Contents
Fetching ...

PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

Zehui Feng, Tian Qiu, Tong Wu, Junxuan Li, Huayuan Xu, Ting Han

TL;DR

PreResQ-R1 tackles visual quality assessment by enabling multimodal LLMs to reason about perceptual fidelity through reinforcement learning that couples absolute score regression with relative ranking. It introduces $PRPO$, a Preference-Response disentangled policy optimization that splits rewards into intra-sample response coherence ($RR$) and inter-sample preference alignment ($PRS$), optimized via $Group elative elative Policy ext{ Optimization}$ (GRPO), and extends to video with a global-temporal/local-spatial data flow and an Exploration-to-Stability fine-tuning strategy. Training on a modest budget of 6K images and 28K videos, it achieves state-of-the-art SRCC and PLCC across 10 IQA and 5 VQA benchmarks and yields human-aligned chain-of-thought reasoning traces for the total reward $R_{ m total}$. The approach demonstrates robust cross-domain generalization, interpretable perceptual cues, and a scalable path for efficient alignment of multimodal evaluators in photography, media compression, and AI-generated content assessment.

Abstract

Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.

PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

TL;DR

PreResQ-R1 tackles visual quality assessment by enabling multimodal LLMs to reason about perceptual fidelity through reinforcement learning that couples absolute score regression with relative ranking. It introduces , a Preference-Response disentangled policy optimization that splits rewards into intra-sample response coherence () and inter-sample preference alignment (), optimized via (GRPO), and extends to video with a global-temporal/local-spatial data flow and an Exploration-to-Stability fine-tuning strategy. Training on a modest budget of 6K images and 28K videos, it achieves state-of-the-art SRCC and PLCC across 10 IQA and 5 VQA benchmarks and yields human-aligned chain-of-thought reasoning traces for the total reward . The approach demonstrates robust cross-domain generalization, interpretable perceptual cues, and a scalable path for efficient alignment of multimodal evaluators in photography, media compression, and AI-generated content assessment.

Abstract

Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.

Paper Structure

This paper contains 24 sections, 17 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Overview Performance. (a) Existing score/ranking reward function assign minimal difference, which results in distribution fall or robustness fail. (b) PreResQ-R1 focus on fine-grained response-ranking reward balance and preference. (c) PreResQ-R1 enables state-of-the-art performance and stable image quality assessment with discriminative reward. (d) typical qualitative and quantitative example comparison between VisualQuality-R1 and PreResQ-R1, which demonstrates superior performance on image quality describe and score.
  • Figure 2: Overall training framework of PreResQ-R1 via RL2RS. Given an sample batch ($\mathcal{I_j}$, $\mathcal{I}_{j+1}$,..., $\mathcal{I}_{j+B})$ with a shared text prompt $\mathcal{P}$, PreResQ-R1 generates K responses. To quickly activate CoT differences and then access generation stability, we introduce the response penalty and fine-grained triplet-response balance reward. To jointly enhance the robustness of ranking and score ability, we introduce the preference pairwise-and-triplet score-and-ranking reward for GRPO.
  • Figure 3: Pipeline of the Preference-Response Disentangled Policy Optimization (PRPO), which applies response ranking response balance reward, and preference pairwise score and ranking reward, and preference triplet ranking reward to optimize group policy learning.
  • Figure 4: (a) Qualitative cases of IQA in comparison and ablation studies. (b) Qualitative cases of VQA in comparison studies.
  • Figure 5: Comparison between PreResQ-R1 and VisualQuality-R1 on distribution of difference between answer and ground truth. The horizontal axis represents the error, and the vertical axis represents the relative proportion. The closer the distribution is to 0, the better the model performance is. Blue and Orange represents PreResQ-R1 and VisualQuality-R1.
  • ...and 9 more figures