IRPO: Boosting Image Restoration via Post-training GRPO

Haoxuan Xu; Yi Liu; Boyuan Jiang; Jinlong Peng; Donghao Luo; Xiaobin Hu; Shuicheng Yan; Haoang Li

IRPO: Boosting Image Restoration via Post-training GRPO

Haoxuan Xu, Yi Liu, Boyuan Jiang, Jinlong Peng, Donghao Luo, Xiaobin Hu, Shuicheng Yan, Haoang Li

TL;DR

IRPO tackles the generalization gap and over-smoothing in low-level image restoration by introducing a GRPO-based post-training framework. It splits the learning into two pillars: data-oriented supervision that focuses on the bottom 30% hardest samples, and reward-oriented optimization using a composite reward (General, Expert via Qwen-VL, and Task-specific) to align restorations with human perception. The method achieves state-of-the-art results on six in-domain and five out-of-domain benchmarks and demonstrates strong all-in-one generalization across degradations, with notable PSNR gains over strong baselines. Its modular design and all-in-one training setup offer a practical pathway to robust, perceptually faithful IR in real-world scenarios, with code available at the authors’ GitHub.

Abstract

Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.

IRPO: Boosting Image Restoration via Post-training GRPO

TL;DR

Abstract

IRPO: Boosting Image Restoration via Post-training GRPO

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)