Table of Contents
Fetching ...

A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, Satya Narayan Shukla

TL;DR

This work tackles the cost-inefficiency of reinforcement learning for text-to-image diffusion fine-tuning by analyzing REINFORCE versus PPO and introducing LOOP, a Leave-One-Out PPO method. LOOP combines variance reduction from REINFORCE (multiple trajectories and a leave-one-out baseline) with PPO’s clipping and importance sampling to maintain stability. Empirical results on the T2I-CompBench benchmark show LOOP achieves substantial improvements over PPO and REINFORCE across attribute binding, aesthetics, and image-text alignment, particularly with more trajectories (K). The approach offers a practical, high-performance RL fine-tuning strategy for diffusion models, albeit with increased training time due to multiple trajectories.

Abstract

Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization (PPO) is the most popular choice of method for policy optimization. While effective in terms of performance, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some computational complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high-variance and sample inefficiency. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.

A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

TL;DR

This work tackles the cost-inefficiency of reinforcement learning for text-to-image diffusion fine-tuning by analyzing REINFORCE versus PPO and introducing LOOP, a Leave-One-Out PPO method. LOOP combines variance reduction from REINFORCE (multiple trajectories and a leave-one-out baseline) with PPO’s clipping and importance sampling to maintain stability. Empirical results on the T2I-CompBench benchmark show LOOP achieves substantial improvements over PPO and REINFORCE across attribute binding, aesthetics, and image-text alignment, particularly with more trajectories (K). The approach offers a practical, high-performance RL fine-tuning strategy for diffusion models, albeit with increased training time due to multiple trajectories.

Abstract

Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization (PPO) is the most popular choice of method for policy optimization. While effective in terms of performance, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some computational complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high-variance and sample inefficiency. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.

Paper Structure

This paper contains 15 sections, 2 theorems, 19 equations, 5 figures, 3 tables.

Key Result

Theorem 3.1

achiam2017constrained Consider a current policy $\pi_{k}$. Let $C^{\pi,\pi_{k}} = \max_{s\in S} |\mathbb{E}_{a\sim\pi(\cdot\mid s)} \left[ A^{\pi_{k}}(s,a) \right]|$, and $\mathrm{TV}(\pi(\cdot\mid s),\pi_{k}(\cdot\mid s))$ represent the total variation distance between the policies $\pi(\cdot\mid

Figures (5)

  • Figure 1: LOOP improves attribute binding. Qualitative examples presented from images generated via Stable Diffusion (SD) 2.0 (first row), DDPO black2023training (second row), and LOOP $k=4$ (third row). In the first prompt, SD and DDPO both fail to bind the color black with the ball in the image, whereas LOOP binds the color black to the ball. In the second example, SD and DDPO fail to generate rusted bronze color lamppost, whereas LOOP manages to do that. In the third image, SD and DDPO fail to bind the shape hexagon to the watermelon, whereas LOOP manages so. In the fourth example, SD and DDPO fail to generate the black horse with flowing cyan patterns, whereas LOOP generates the horse with the correct color attribute. Finally, in the last image, SD and DDPO fail to bind cobalt blue color to the rock, whereas LOOP binds that successfully.
  • Figure 2: Evaluating REINFORCE vs. PPO trade-off by comparing: REINFORCE (Eq. \ref{['eq:cb_pg']}), REINFORCE with baseline correction term (Eq. \ref{['eq:cb_pg_b']}), and PPO (Eq. \ref{['eq:ppo_obj']}). We evaluate on the T2I-CompBench benchmark over three image attributes: Color, Shape, and Texture. We also compare on the aesthetic task. Y-axis corresponds to the training reward, x-axis corresponds to the training epoch. Results are averaged over 3 runs; shaded areas indicate 80% prediction intervals.
  • Figure 3: Comparing DDPO (referenced as PPO) with the proposed LOOP on the T2I-CompBench benchmark with respect to image attributes: Color, Shape, Texture, and Spatial relationship. We also report results on aesthetic preference and image–text alignment tasks black2023training. The y-axis shows training reward, and the x-axis shows training epoch. Results are averaged over three independent runs; shaded areas denote 80% prediction intervals.
  • Figure 4: LOOP improves aesthetic quality. Qualitative examples are presented from images generated via: Stable Diffusion 2.0 (first row), PPO (second row), and LOOP $k=4$ (third row). LOOP consistently generates more aesthetic images, as compared to PPO and SD.
  • Figure 5: Additional qualitative examples presented from images generated via Stable Diffusion 2.0 (first row), PPO (second row), and LOOP $k=4$ (third row). LOOP consistently generates more aesthetic images, as compared to PPO and SD (first, third, and fifth prompt). LOOP also binds the color attribute (teal branch in second example, and pink cornfield in the forth example), where SD and PPO fail.

Theorems & Definitions (4)

  • Definition 1
  • Theorem 3.1
  • Proposition 4.1
  • proof