Table of Contents
Fetching ...

RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution

Yushuai Song, Weize Quan, Weining Wang, Jiahui Sun, Jing Liu, Meng Li, Pengbin Yu, Zhentao Chen, Wei Shen, Lunxi Yuan, Dong-ming Yan

Abstract

Recent advances in generative super-resolution (SR) have greatly improved visual realism, yet existing evaluation and optimization frameworks remain misaligned with human perception. Full-Reference and No-Reference metrics often fail to reflect perceptual preference, either penalizing semantically plausible details due to pixel misalignment or favoring visually sharp but inconsistent artifacts. Moreover, most SR methods rely on ground-truth (GT)-dependent distribution matching, which does not necessarily correspond to human judgments. In this work, we propose RefReward-SR, a low-resolution (LR) reference-aware reward model for preference-aligned SR. Instead of relying on GT supervision or NR evaluation, RefReward-SR assesses high-resolution (HR) reconstructions conditioned on their LR inputs, treating the LR image as a semantic anchor. Leveraging the visual-linguistic priors of a Multimodal Large Language Models (MLLM), it evaluates semantic consistency and plausibility in a reasoning-aware manner. To support this paradigm, we construct RefSR-18K, the first large-scale LR-conditioned preference dataset for SR, providing pairwise rankings based on LR-HR consistency and HR naturalness. We fine-tune the MLLM with Group Relative Policy Optimization (GRPO) using LR-conditioned ranking rewards, and further integrate GRPO into SR model training with RefReward-SR as the core reward signal for preference-aligned generation. Extensive experiments show that our framework achieves substantially better alignment with human judgments, producing reconstructions that preserve semantic consistency while enhancing perceptual plausibility and visual naturalness. Code, models, and datasets will be released upon paper acceptance.

RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution

Abstract

Recent advances in generative super-resolution (SR) have greatly improved visual realism, yet existing evaluation and optimization frameworks remain misaligned with human perception. Full-Reference and No-Reference metrics often fail to reflect perceptual preference, either penalizing semantically plausible details due to pixel misalignment or favoring visually sharp but inconsistent artifacts. Moreover, most SR methods rely on ground-truth (GT)-dependent distribution matching, which does not necessarily correspond to human judgments. In this work, we propose RefReward-SR, a low-resolution (LR) reference-aware reward model for preference-aligned SR. Instead of relying on GT supervision or NR evaluation, RefReward-SR assesses high-resolution (HR) reconstructions conditioned on their LR inputs, treating the LR image as a semantic anchor. Leveraging the visual-linguistic priors of a Multimodal Large Language Models (MLLM), it evaluates semantic consistency and plausibility in a reasoning-aware manner. To support this paradigm, we construct RefSR-18K, the first large-scale LR-conditioned preference dataset for SR, providing pairwise rankings based on LR-HR consistency and HR naturalness. We fine-tune the MLLM with Group Relative Policy Optimization (GRPO) using LR-conditioned ranking rewards, and further integrate GRPO into SR model training with RefReward-SR as the core reward signal for preference-aligned generation. Extensive experiments show that our framework achieves substantially better alignment with human judgments, producing reconstructions that preserve semantic consistency while enhancing perceptual plausibility and visual naturalness. Code, models, and datasets will be released upon paper acceptance.
Paper Structure (26 sections, 9 equations, 12 figures, 7 tables)

This paper contains 26 sections, 9 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overview of our proposed evaluation paradigm. (a) HR A preserves structures, while HR B shows severe distortions. (b) RefReward-SR uses an MLLM for interpretable, LR-conditioned semantic scoring. (c) Existing metrics wrongly favor the distorted HR B, whereas ours aligns with human perception. (d) RefReward-SR uniquely ensures semantic fidelity and preference alignment without GT dependency. The asterisk(*) indicates that FR metrics evaluate high-level semantics only passively via low-level correspondence.
  • Figure 2: The proposed RefReward-SR framework. (a) MLLM Fine-tuning: Fine-tuning via GRPO with format and LR-conditioned rank rewards from RefSR-18K for human preference alignment. (b) Global-Local Crop Scoring: Extracting representative local crops via RAM and Grounding DINO, then fusing multi-scale MLLM scores via area-weighted averaging for comprehensive evaluation.
  • Figure 3: The result of the user study win rate and average RefReward-SR score.
  • Figure 3: Ablation study of the RefReward-SR evaluator on both in-domain and out-of-domain test sets. Agreement is reported in %.
  • Figure 4: Qualitative comparison of state-of-the-art generative SR methods. Please zoom in for a better view. Additional results are provided in the Suppl. Mater..
  • ...and 7 more figures