Table of Contents
Fetching ...

ROCM: RLHF on consistency models

Shivanshu Shekhar, Tong Zhang

TL;DR

This work addresses the slow sampling and training challenges of diffusion-based RLHF by leveraging consistency models with a direct reward-optimization framework. It introduces a regularized RLHF objective that uses distributional regularization across intermediate steps via $f$-divergences and exploits the reparameterization trick to backpropagate through the entire generation trajectory, replacing policy-gradient methods. Empirically, Regularized ROCM achieves competitive or superior results across multiple reward models and metrics, with faster training and improved human preferences, while analysis shows that regularization mitigates reward hacking and enhances generalization. The approach hinges on differentiable rewards and Gaussian-conditioned divergences, offering a practical, first-order alternative to PPO for RLHF on consistency models and suggesting avenues for further exploration of divergence choices and reward-model interactions.

Abstract

Diffusion models have revolutionized generative modeling in continuous domains like image, audio, and video synthesis. However, their iterative sampling process leads to slow generation and inefficient training, challenges that are further exacerbated when incorporating Reinforcement Learning from Human Feedback (RLHF) due to sparse rewards and long time horizons. Consistency models address these issues by enabling single-step or efficient multi-step generation, significantly reducing computational costs. In this work, we propose a direct reward optimization framework for applying RLHF to consistency models, incorporating distributional regularization to enhance training stability and prevent reward hacking. We investigate various $f$-divergences as regularization strategies, striking a balance between reward maximization and model consistency. Unlike policy gradient methods, our approach leverages first-order gradients, making it more efficient and less sensitive to hyperparameter tuning. Empirical results show that our method achieves competitive or superior performance compared to policy gradient based RLHF methods, across various automatic metrics and human evaluation. Additionally, our analysis demonstrates the impact of different regularization techniques in improving model generalization and preventing overfitting.

ROCM: RLHF on consistency models

TL;DR

This work addresses the slow sampling and training challenges of diffusion-based RLHF by leveraging consistency models with a direct reward-optimization framework. It introduces a regularized RLHF objective that uses distributional regularization across intermediate steps via -divergences and exploits the reparameterization trick to backpropagate through the entire generation trajectory, replacing policy-gradient methods. Empirically, Regularized ROCM achieves competitive or superior results across multiple reward models and metrics, with faster training and improved human preferences, while analysis shows that regularization mitigates reward hacking and enhances generalization. The approach hinges on differentiable rewards and Gaussian-conditioned divergences, offering a practical, first-order alternative to PPO for RLHF on consistency models and suggesting avenues for further exploration of divergence choices and reward-model interactions.

Abstract

Diffusion models have revolutionized generative modeling in continuous domains like image, audio, and video synthesis. However, their iterative sampling process leads to slow generation and inefficient training, challenges that are further exacerbated when incorporating Reinforcement Learning from Human Feedback (RLHF) due to sparse rewards and long time horizons. Consistency models address these issues by enabling single-step or efficient multi-step generation, significantly reducing computational costs. In this work, we propose a direct reward optimization framework for applying RLHF to consistency models, incorporating distributional regularization to enhance training stability and prevent reward hacking. We investigate various -divergences as regularization strategies, striking a balance between reward maximization and model consistency. Unlike policy gradient methods, our approach leverages first-order gradients, making it more efficient and less sensitive to hyperparameter tuning. Empirical results show that our method achieves competitive or superior performance compared to policy gradient based RLHF methods, across various automatic metrics and human evaluation. Additionally, our analysis demonstrates the impact of different regularization techniques in improving model generalization and preventing overfitting.

Paper Structure

This paper contains 10 sections, 13 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: Examples of images generated by the model aligned using the KL divergence regularization constraint and HPS reward model
  • Figure 2: Sample Images generated by our baselines and ROCM trained on HPSv2 as reward model.
  • Figure 3: User study comparing Our best models for each reward model with RLCM RLCM fine-tuned on that reward model, we follow SPO SPO and choose in total 300 randomly sampled prompts from Partiprompts Parti and HPS HPS we sample in the ratio of 1:2 respectively.
  • Figure 4: As $\beta$ decreases, we observe an initial improvement in model performance. However, with further reduction in $\beta$, the actual preference reaches a peak and then begins to decline, indicating reward hacking.
  • Figure 5: This figure illustrates the training efficiency of each method, with Figures A, B, C, and D representing models trained using CLIPScore, Aesthetic Score, PickScore, and HPSv2, respectively. Our method consistently outperforms others in terms of training efficiency across different reward models. Notably, improvements are relatively minor for PickScore and CLIPScore. The limited gain in CLIPScore is expected, as it primarily aids in prompt alignment, while PickScore's lower sensitivity to image quality results in a smaller increase. In contrast, HPSv2 and Aesthetic Score exhibit significant improvements within just 15 GPU hours. We used a running average of window size 20 to arrive at the error bars and mean.
  • ...and 1 more figures