Table of Contents
Fetching ...

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Zijing Hu, Fengda Zhang, Long Chen, Kun Kuang, Jiahui Li, Kaifeng Gao, Jun Xiao, Xin Wang, Wenwu Zhu

TL;DR

This paper tackles the prompt-image misalignment problem in text-to-image diffusion models by addressing the sparse reward challenge in RL-based fine-tuning. It introduces B^2-DiffuRL, which combines backward progressive training (starting from the final denoising steps and progressively extending backward) with branch-based sampling (creating contrastive samples within fixed training intervals) to yield informative gradient signals without sacrificing diversity. The approach is compatible with multiple RL algorithms and shows consistent improvements in alignment (measured by CLIPScore/BERTScore) while mitigating diversity loss, demonstrated on Stable Diffusion with extensive ablations and generalization tests. The work advances practical, high-quality, and diverse prompt-conditioned image generation and provides a framework that can be integrated with existing RL-based diffusion fine-tuning methods across modalities and prompts.

Abstract

Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$, employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

TL;DR

This paper tackles the prompt-image misalignment problem in text-to-image diffusion models by addressing the sparse reward challenge in RL-based fine-tuning. It introduces B^2-DiffuRL, which combines backward progressive training (starting from the final denoising steps and progressively extending backward) with branch-based sampling (creating contrastive samples within fixed training intervals) to yield informative gradient signals without sacrificing diversity. The approach is compatible with multiple RL algorithms and shows consistent improvements in alignment (measured by CLIPScore/BERTScore) while mitigating diversity loss, demonstrated on Stable Diffusion with extensive ablations and generalization tests. The work advances practical, high-quality, and diverse prompt-conditioned image generation and provides a framework that can be integrated with existing RL-based diffusion fine-tuning methods across modalities and prompts.

Abstract

Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named , employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.

Paper Structure

This paper contains 32 sections, 9 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: (Prompt-image Misalignment) Text-to-image diffusion models (e.g., Stable Diffusion (SD) rombach2022latent) may not generate high-quality images that accurately align with prompts. Existing reinforcement learning-based diffusion model fine-tuning methods (e.g., DDPO Black2023TrainingDM) have limited effect and loss of image diversity. For each set of images above, we use the same seed for sampling.
  • Figure 2: (Sparse Reward) When people train diffusion models with reinforcement learning (RL), the reward is only available at the end of the generation process. This sparsity limits the success of RL in diffusion models. We propose $\text{B}^2\text{-DiffuRL}$, a new RL framework with two strategies, to mitigate this issue.
  • Figure 3: (Method) We propose the framework $\text{B}^2\text{-DiffuRL}$, employing two strategies to address the challenge of sparse rewards. (a) Backward progressive training strategy: We focus initially on the final timesteps of the denoising process and gradually extend the training interval to earlier timesteps, easing the learning difficulty associated with sparse rewards. (b) Branch-based sampling strategy: We perform branch-based sampling at the beginning of each training interval. Comparisons between samples within the same branch provide a clear indication of whether the policies of the current training interval positively contribute to the final images.
  • Figure 4: (Samples) Examples of images generated by different methods on three templates. For each set of images, we use the same random seed. Our method achieves better prompt-image alignment compared to vanilla Stable Diffusion and DDPO.
  • Figure 5: (Alignment) Alignment curves of our method and DDPO on three prompt templates.
  • ...and 12 more figures