Table of Contents
Fetching ...

Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

Shijie Zhao, Xuanyu Zhang, Weiqi Li, Junlin Li, Li Zhang, Tianfan Xue, Jian Zhang

TL;DR

This work investigates why reasoning-based IQA models generalize when trained with reinforcement learning and finds that converting rich visual input into concise quality-descriptive text is the primary generalization mechanism. It introduces two frameworks: Reasoning-Aligned Cross-domain Training (RACT), which leverages cross-domain alignment of image descriptions to improve cross-dataset performance, and Reasoning-Aligned Lightweight IQA (RALI), a three-stage, reasoning-free pipeline that uses contrastive alignment, PCA-based compression, and a basis-vector scoring scheme to closely match RL-based IQA performance with far fewer parameters and much lower latency. Empirical results show RALI achieves competitive PLCC/SRCC with about 0.3B parameters (roughly 4% of a large RL-based model) and orders-of-magnitude faster inference, while delivering strong cross-domain performance and surpassing non-reasoning baselines like CLIP-IQA+. The findings offer a practical path toward efficient, high-quality IQA suitable for on-device deployment and scalable reward modeling in generative systems. The work also highlights broader implications for vision-language tasks by illustrating how concise textual representations can bridge domain gaps and sustain generalization.

Abstract

Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.

Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

TL;DR

This work investigates why reasoning-based IQA models generalize when trained with reinforcement learning and finds that converting rich visual input into concise quality-descriptive text is the primary generalization mechanism. It introduces two frameworks: Reasoning-Aligned Cross-domain Training (RACT), which leverages cross-domain alignment of image descriptions to improve cross-dataset performance, and Reasoning-Aligned Lightweight IQA (RALI), a three-stage, reasoning-free pipeline that uses contrastive alignment, PCA-based compression, and a basis-vector scoring scheme to closely match RL-based IQA performance with far fewer parameters and much lower latency. Empirical results show RALI achieves competitive PLCC/SRCC with about 0.3B parameters (roughly 4% of a large RL-based model) and orders-of-magnitude faster inference, while delivering strong cross-domain performance and surpassing non-reasoning baselines like CLIP-IQA+. The findings offer a practical path toward efficient, high-quality IQA suitable for on-device deployment and scalable reward modeling in generative systems. The work also highlights broader implications for vision-language tasks by illustrating how concise textual representations can bridge domain gaps and sustain generalization.

Abstract

Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.

Paper Structure

This paper contains 20 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Performance comparison among IQA methods in PLCC/SRCC and parameter numbers. RALI uses only about 4% of Q-Insight’s (li2025q) parameters while achieving comparable accuracy.
  • Figure 2: Comparison of Reasoning Between Q-Insight and Qwen-VL on the IQA Task. (a) Q-Insight’s reasoning was more concise than Qwen-VL’s and more correlated with image quality. (b) On the KonIQ (hosu2020koniq) test set, CLIP features derived from Q-Insight’s reasoning were more correlated with scores following t-SNE visualization. In this paper, we adopt the quality description between $<$think$>$ and $<$/think$>$ as the reasoning contents for brevity.
  • Figure 3: Attention heatmap during score-token generation of Q-Insight. It primarily attends to reasoning text tokens instead of visual tokens (95% vs. 5%).
  • Figure 4: t-SNE visualization of visual tokens and reasoning text tokens from SPAQ and KonIQ.
  • Figure 5: Illustration of the proposed reasoning-aligned lightweight IQA (RALI) framework. (a) presents the components and functions of the RL-based IQA model. (b)–(d) jointly constitute our lightweight RALI scheme, including contrastive learning with quality descriptions, feature compression, and score definition. The model’s inference pipeline is identical to (d).
  • ...and 4 more figures