HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

Ayano Hiranaka; Shang-Fu Chen; Chieh-Hsin Lai; Dongjun Kim; Naoki Murata; Takashi Shibuya; Wei-Hsiang Liao; Shao-Hua Sun; Yuki Mitsufuji

HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

Ayano Hiranaka, Shang-Fu Chen, Chieh-Hsin Lai, Dongjun Kim, Naoki Murata, Takashi Shibuya, Wei-Hsiang Liao, Shao-Hua Sun, Yuki Mitsufuji

TL;DR

HERO tackles the challenge of aligning diffusion-based text-to-image generation with human intent using online human feedback. It introduces Feedback-Aligned Representation Learning to convert discrete feedback into continuous rewards and Feedback-Guided Image Generation to seed sampling from refined noises, enabling efficient DDPO-based fine-tuning with LoRA. The approach yields substantial improvements in feedback efficiency (roughly 4x) and demonstrates transferability of learned preferences and safety concepts across prompts, including tasks requiring spatial reasoning and personalization. Overall, HERO provides a practical, data-efficient framework for online RLHF in diffusion models with tangible gains in controllable generation and safety containment.

Abstract

Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult. To effectively and efficiently utilize human feedback, we develop a framework, HERO, which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for fine-tuning, and (2) Feedback-Guided Image Generation, which involves generating images from SD's refined initialization samples, enabling faster convergence towards the evaluator's intent. We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback. The code and project page are available at https://hero-dm.github.io/.

HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

TL;DR

Abstract

Paper Structure (42 sections, 2 theorems, 33 equations, 15 figures, 10 tables, 1 algorithm)

This paper contains 42 sections, 2 theorems, 33 equations, 15 figures, 10 tables, 1 algorithm.

Introduction
Related Works
Preliminaries
Stable Diffusion (SD)
Denoising Diffusion Policy Optimization (DDPO)
Problem Setup and the Proposed Method
Online Human Feedback
Feedback-Aligned Representation Learning
Learning Representations
Similarity-based Rewards Computation
Diffusion Model Finetuning
Feedback-Guided Image Generation
Experimental Results
Hand Deformation Correction
Demonstration on the Variety of Tasks
...and 27 more sections

Key Result

Proposition A.1

Let $\pi$ be a Gaussian mixture with each component as $\mathcal{N}(\bm{\mu}_i, \varepsilon_0^2 \mathbf{I}_D)$, where each mean $\bm{\mu}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_D)$, and $\varepsilon_0 > 0$ is a small constant. Let ${\mathbf{y}} \sim \pi$ be a random vector drawn from $\pi$. Then, Namely, ${\mathbf{y}}$ is concentrated around the shell of radius $\sqrt{D}$ and thickness $\sqrt{D

Figures (15)

Figure 1: ⓪ Online Human Feedback on Generated Images: Each epoch, SD generates a batch of images, evaluated by a human as "good" or "bad", with the "best" among the "good" selected. The corresponding SD noises and latents are saved. ① Feedback-Aligned Representation Learning: Human-annotated images train an embedding map via contrastive learning, converting feedback into continuous representations. These are rated by cosine similarity to one of the "best" images and used to fine-tune SD via DDPO black2024trainingdiffusionmodelsreinforcement. ② Feedback-Guided Image Generation: New images are generated from a Gaussian mixture centered around the recorded noises of "good" images. This process is repeated until the feedback budget is exhausted.
Figure 2: Result preview. Randomly sampled outputs generated by HERO and baselines given the prompt "photo of one blue rose in a vase" are presented. Successful samples are marked with , and unsuccessful samples are marked with , which fail to accurately capture the specified count (more than one roses), color (non-blue roses), and context (missing vase). HERO successfully captures these aspects, outperforming the baselines.
Figure 3: Hand anomaly correction success rates. Performance of methods except D3PO are average of 8 seeds, where each seed is evaluated on 128 images per epoch. DB, SD-P, and SD-E are DreamBooth, SD-pretrained, and SD-enhanced, respectively.
Figure 4: Qualitative results. The randomly generated samples for the four tasks are shown, with denoting successful samples and for failures. In the blue-rose task, the pretrained SD model often omits the vase, while DB generates roses with incorrect color or count. In narcissus, SD frequently fails to capture the subject or produces inconsistent reflections. For black-cat, baseline models exhibit more issues (e.g., the cat's body penetrating the box). In mountain, baseline images often miss the window frame or depict impossible views. Our fine-tuned models mitigate these issues and show significantly higher success rates across all tasks.
Figure 5: Effect of best image ratio $\beta$ evaluated on the black-cat task. Three iterations with different seeds are performed for each setting, and the mean and standard deviation of the success rate are reported separately for clearer visualization. "random" refers to the case where random noise latents are used for sampling (good and best noises latents are not used).
...and 10 more figures

Theorems & Definitions (4)

Proposition A.1: Concentration of $\pi_{\mathrm{HERO}}$
proof
Proposition A.2: Information Link Between ${\mathbf{z}}_T$ and Generated ${\mathbf{z}}_0$
proof

HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

TL;DR

Abstract

HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (4)