Table of Contents
Fetching ...

One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment

Wen Yin, Cencen Liu, Dingrui Liu, Bing Su, Yuan-Fang Li, Tao He

Abstract

Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task's nature. TATAR combines three components: fast--slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at https://github.com/yinwen2019/TATAR.

One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment

Abstract

Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task's nature. TATAR combines three components: fast--slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at https://github.com/yinwen2019/TATAR.
Paper Structure (15 sections, 10 equations, 7 figures, 2 tables)

This paper contains 15 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: From shared to task-conditioned post-training. Prior unified methods apply the same response behavior and reward to both IQA and IAA, conflating two tasks with fundamentally different decision regimes. TATAR decouples what is shared (the VL backbone) from what is adapted (post-training): Stage 1 instills task-specific reasoning modes via fast--slow SFT, and Stage 2 refines scoring with asymmetric rewards matched to each task's supervision geometry.
  • Figure 2: Motivation for task-conditioned post-training. (a) Adding a CoT instruction yields asymmetric effects: it consistently helps IAA but hurts IQA, demonstrating that a uniform reasoning strategy is suboptimal. (b) Under shared-reward RFT, IQA converges to short, stable completions while IAA produces long, variable ones, revealing that the two tasks inhabit different response and optimization regimes.
  • Figure 3: Task-conditioned reasoning construction. IQA rationales are synthesized by score-conditioned reverse inference from image--score pairs, while IAA rationales are obtained by summarizing structured aesthetic annotations. A unified judge then filters low-quality generations, producing the final fast--slow reasoning corpus used in Stage 1 SFT.
  • Figure 4: Overview of TATAR. TATAR is a unified framework for IQA and IAA with task-conditioned reasoning and reward design. In Stage 1, format-oriented SFT aligns the model to a shared <think>/<answer> output schema while teaching distinct response modes, with concise reasoning for IQA and more deliberative reasoning for IAA. In Stage 2, task-conditioned GRPO further refines the model with a shared optimization procedure but asymmetric rewards, where IQA is optimized with a Gaussian score reward $\mathcal{R}_{\mathrm{fmt}}+\mathcal{R}_{\mathrm{score}}$ and IAA is optimized with a ranking reward $\mathcal{R}_{\mathrm{fmt}}+\mathcal{R}_{\mathrm{rank}}$.
  • Figure 5: Two examples from the constructed reasoning corpus (QACoT-score dataset). IQA (top) samples use concise, distortion-focused rationales, while IAA (bottom) samples use longer and more integrative aesthetic rationales.
  • ...and 2 more figures