A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

Shashank Gupta; Chaitanya Ahuja; Tsung-Yu Lin; Sreya Dutta Roy; Harrie Oosterhuis; Maarten de Rijke; Satya Narayan Shukla

A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, Satya Narayan Shukla

TL;DR

This work tackles the cost-inefficiency of reinforcement learning for text-to-image diffusion fine-tuning by analyzing REINFORCE versus PPO and introducing LOOP, a Leave-One-Out PPO method. LOOP combines variance reduction from REINFORCE (multiple trajectories and a leave-one-out baseline) with PPO’s clipping and importance sampling to maintain stability. Empirical results on the T2I-CompBench benchmark show LOOP achieves substantial improvements over PPO and REINFORCE across attribute binding, aesthetics, and image-text alignment, particularly with more trajectories (K). The approach offers a practical, high-performance RL fine-tuning strategy for diffusion models, albeit with increased training time due to multiple trajectories.

Abstract

Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization (PPO) is the most popular choice of method for policy optimization. While effective in terms of performance, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some computational complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high-variance and sample inefficiency. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.

A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

TL;DR

Abstract

A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)