Table of Contents
Fetching ...

Reward Sharpness-Aware Fine-Tuning for Diffusion Models

Kwanyoung Kim, Byeongsu Sim

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.

Reward Sharpness-Aware Fine-Tuning for Diffusion Models

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.
Paper Structure (44 sections, 19 equations, 10 figures, 11 tables, 1 algorithm)

This paper contains 44 sections, 19 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: Qualitative comparison across different diffusion backbones and RDRL frameworks. (Top): SD1.5stablediffusion results on Draft-LVdraftk and AlignPropalignprop (Middle–Bottom): Larger backbones (SDXLsdxl and SD3stablediffusion3) on ReFLrefl and DRTunedrtune. Each panel compares the vanilla model, the baseline RDRL method, and the same method combined with RSA-FT (Ours). RSA-FT is compatible with diverse reward-centric diffusion reinforcement learning frameworks and backbones, effectively mitigating reward hacking and producing clear improvements in visual quality and text–prompt alignment.
  • Figure 2: Illustration of reward hacking in RDRL (Draft-LV). The original reward model raises the HPS v2.1 score but degrades other metrics and visual quality, whereas our flattened model improves all metrics with consistent visuals.
  • Figure 3: Geometry of reward fine-tuning and our proposed method. Reward models are inherently sharp and prone to adversarial perturbations. Flattening these reward landscapes alleviates their sensitivity and reduces the occurrence of adversarial gradients. (a) Prior methods directly maximize rewards along adversarial gradients from sharp reward surfaces, which often leads to reward hacking. (b) Our method instead leverages gradients from flattened reward models, mitigating hacking by flattening both the image and parameter spaces.
  • Figure 4: Negative correlation between reward sharpness and human preference. Higher sharpness in the reward model correlates with lower preference quality (Pearson $r_{\text{corr}}=-0.802$ for PickScore, $r_{\text{corr}}=-0.669$ for ImageReward), supporting the hypothesized negative relationship.
  • Figure 5: Human preference study results. The dashed line indicates the 50% mark; crossing it demonstrates a strict preference over the baseline.
  • ...and 5 more figures