Table of Contents
Fetching ...

GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Lei Zhang

Abstract

Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: https://github.com/Joyies/GDPO.

GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

Abstract

Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: https://github.com/Joyies/GDPO.
Paper Structure (17 sections, 9 equations, 9 figures, 12 tables)

This paper contains 17 sections, 9 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Noise regulates the diversity of generated samples. Different noise inputs yield both high-quality samples (e.g., noise1) and low-quality ones (e.g., noise2 and noise3). After preference learning, the model produces more visually pleasing results.
  • Figure 2: The framework of GDPO, which consists of two core stages: (a) advantage calculation and (b) policy optimization. Firstly, we employ a pre-trained one-step Real-ISR model as the reference model to generate a group of diverse outputs by injecting different random noises. Subsequently, we compute the advantage $\mathcal{A}$ for each sample by evaluating its reward with our designed attribute-aware reward functions and converting these rewards into group-relative advantages. In the policy optimization stage, we feed these samples along with noises into both the policy model and the reference ISR model, and update the parameters of the policy ISR model by minimizing the proposed GDPO loss, steering it to favor generating high-reward samples.
  • Figure 3: The structure of NAOSD, which uses the $t_{add}$ to control the intensity of injected noise.
  • Figure 4: The pipeline of calculating smooth and detailed regions.
  • Figure 5: Visual comparison with SD-based Real-ISR methods. Please zoom in for a better view.
  • ...and 4 more figures