Table of Contents
Fetching ...

IRPO: Boosting Image Restoration via Post-training GRPO

Haoxuan Xu, Yi Liu, Boyuan Jiang, Jinlong Peng, Donghao Luo, Xiaobin Hu, Shuicheng Yan, Haoang Li

TL;DR

IRPO tackles the generalization gap and over-smoothing in low-level image restoration by introducing a GRPO-based post-training framework. It splits the learning into two pillars: data-oriented supervision that focuses on the bottom 30% hardest samples, and reward-oriented optimization using a composite reward (General, Expert via Qwen-VL, and Task-specific) to align restorations with human perception. The method achieves state-of-the-art results on six in-domain and five out-of-domain benchmarks and demonstrates strong all-in-one generalization across degradations, with notable PSNR gains over strong baselines. Its modular design and all-in-one training setup offer a practical pathway to robust, perceptually faithful IR in real-world scenarios, with code available at the authors’ GitHub.

Abstract

Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.

IRPO: Boosting Image Restoration via Post-training GRPO

TL;DR

IRPO tackles the generalization gap and over-smoothing in low-level image restoration by introducing a GRPO-based post-training framework. It splits the learning into two pillars: data-oriented supervision that focuses on the bottom 30% hardest samples, and reward-oriented optimization using a composite reward (General, Expert via Qwen-VL, and Task-specific) to align restorations with human perception. The method achieves state-of-the-art results on six in-domain and five out-of-domain benchmarks and demonstrates strong all-in-one generalization across degradations, with notable PSNR gains over strong baselines. Its modular design and all-in-one training setup offer a practical pathway to robust, perceptually faithful IR in real-world scenarios, with code available at the authors’ GitHub.

Abstract

Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.

Paper Structure

This paper contains 48 sections, 24 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: An overview of our IRPO post-training paradigm and its performance. (Left) Radar plot comparing average PSNR, showing IRPO achieves SOTA In-Domain performance and vastly superior Out-of-Domain generalization. (Right) The two pillars of our paradigm: Data-oriented, which finds that training on the 30% Weak Data (underperforming subset) is optimal , and Reward-oriented showing the benefit of our reward components.
  • Figure 2: The overview of our proposed post-training paradigm, visually structured around its two pillars. Pillar 1 (Data-Oriented, left): A pre-trained model evaluates the full dataset to curate $\mathcal{D}_{\text{hard}}$, which serves as the post-training data. Pillar 2 (Reward-Oriented, right): A multi-component reward model (General, Expert, Task-Aware) provides signals to train the policy ($\pi_\theta$). The Image Restoration Net restores underperforming data $\mathcal{D}_{\text{hard}}$, including some TB (Transformer Block) and GDM (GRPO-Driven Model,bottom-middle).
  • Figure 3: Visual comparisons for different restoration tasks. The first row is derain, the second row is dehaze, and the third row is denoise (noise level 50). Please zoom in for better details.
  • Figure 4: Visual comparisons on real-world datasets. From top to bottom: deblurring, dehazing, denoising, deraining, and low-light enhancement.Please zoom in for better details.
  • Figure 5: Post-training on underperforming vs. random subsets for the All-in-One five tasks. The left y-axis shows average PSNR (dB) and the right y-axis shows post-training time (GPU days); the dashed line marks the AdaIR baseline (30.2 dB). Balancing accuracy and cost, the 30% underperforming subset offers the best trade-off (31.4 dB in 4.2 days, comparable to 100% data but three times faster).
  • ...and 4 more figures