Table of Contents
Fetching ...

DRAGON: Distributional Rewards Optimize Diffusion Generative Models

Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, Nicholas J. Bryan

TL;DR

DRAGON addresses the misalignment between diffusion-based media generation and downstream objectives by introducing distributional rewards and an online, on-policy optimization framework. It constructs positive and negative demonstration sets from on-policy generations using exemplar distributions and optimizes toward target distributions via KL-based objectives, employing surrogate losses like Diffusion-DPO and Diffusion-KTO. The approach supports instance-wise, instance-to-distribution, and distribution-to-distribution rewards, including cross-modal references (e.g., text for audio), and evaluates on a text-to-music diffusion model across 20 reward signals, achieving substantial improvements in both objective metrics (FAD, CLAP, Vendi) and human-perceived quality. Practically, DRAGON reduces reliance on human preference data by enabling reward design from exemplar sets and demonstrates strong, data-efficient gains across multiple reward modalities with potential applicability to images and video as well.

Abstract

We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modal encoders such as CLAP are used, the reference may be of a different modality (text versus audio). Then, DRAGON gathers online and on-policy generations, scores them with the reward function to construct a positive demonstration set and a negative set, and leverages the contrast between the two finite sets to approximate distributional reward optimization. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. DRAGON is a new approach to designing and optimizing reward functions for improving human-perceived quality. Demos at https://ml-dragon.github.io/web

DRAGON: Distributional Rewards Optimize Diffusion Generative Models

TL;DR

DRAGON addresses the misalignment between diffusion-based media generation and downstream objectives by introducing distributional rewards and an online, on-policy optimization framework. It constructs positive and negative demonstration sets from on-policy generations using exemplar distributions and optimizes toward target distributions via KL-based objectives, employing surrogate losses like Diffusion-DPO and Diffusion-KTO. The approach supports instance-wise, instance-to-distribution, and distribution-to-distribution rewards, including cross-modal references (e.g., text for audio), and evaluates on a text-to-music diffusion model across 20 reward signals, achieving substantial improvements in both objective metrics (FAD, CLAP, Vendi) and human-perceived quality. Practically, DRAGON reduces reliance on human preference data by enabling reward design from exemplar sets and demonstrates strong, data-efficient gains across multiple reward modalities with potential applicability to images and video as well.

Abstract

We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modal encoders such as CLAP are used, the reference may be of a different modality (text versus audio). Then, DRAGON gathers online and on-policy generations, scores them with the reward function to construct a positive demonstration set and a negative set, and leverages the contrast between the two finite sets to approximate distributional reward optimization. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. DRAGON is a new approach to designing and optimizing reward functions for improving human-perceived quality. Demos at https://ml-dragon.github.io/web

Paper Structure

This paper contains 40 sections, 13 equations, 13 figures, 11 tables, 1 algorithm.

Figures (13)

  • Figure 1: DPO versus KTO loss function; paired versus unpaired demonstrations.
  • Figure 2: DRAGON with different demonstration diffusion steps and inference steps.
  • Figure 3: Vendi score of models optimized for each reward type. Point height represents Vendi score and point size represents aesthetics win rate. Each per-song/dataset FAD point train with a different reference statistic. Bar height averages point height.
  • Figure 4: Ablation study on aesthetics model settings. Higher correlation with human ratings means better aesthetics model performance.
  • Figure 5: Histograms of human-rated and predicted aesthetics score over the DMA dataset after global label normalization.
  • ...and 8 more figures