Diversity-Rewarded CFG Distillation

Geoffrey Cideron; Andrea Agostinelli; Johan Ferret; Sertan Girgin; Romuald Elie; Olivier Bachem; Sarah Perrin; Alexandre Ramé

Diversity-Rewarded CFG Distillation

Geoffrey Cideron, Andrea Agostinelli, Johan Ferret, Sertan Girgin, Romuald Elie, Olivier Bachem, Sarah Perrin, Alexandre Ramé

TL;DR

The paper tackles the CFG limitation of increased inference cost and reduced diversity by introducing diversity-rewarded CFG distillation, which distills CFG quality into model weights while promoting diversity through reinforcement learning with an embedding-based diversity reward. It formulates a joint objective QD(θ) = Q(θ) + β D(θ) and enables deployment-time quality-diversity control via model merging, interpolating between a quality-focused and a diversity-focused model with θ_LERP = (1−λ) θ_q + λ θ_d. The approach is validated on text-to-music generation using MusicLM, showing improved quality-diversity Pareto-optimality over CFG, and human evaluations demonstrate that weight-merged models can achieve higher diversity without sacrificing quality. The work highlights practical benefits for creative AI tasks by eliminating CFG inference costs while enabling tunable exploration, and suggests extensions to other modalities and future refinements in diversity metrics.

Abstract

Generative models are transforming creative domains such as music generation, with inference-time strategies like Classifier-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By finetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model merging strategies: by interpolating between the weights of two models (the first focusing on quality, the second on diversity), we can control the quality-diversity trade-off at deployment time, and even further boost performance. We conduct extensive experiments on the MusicLM (Agostinelli et al., 2023) text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our finetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG. Explore our generations at https://google-research.github.io/seanet/musiclm/diverse_music/.

Diversity-Rewarded CFG Distillation

TL;DR

Abstract

Paper Structure (21 sections, 11 equations, 7 figures)

This paper contains 21 sections, 11 equations, 7 figures.

Introduction
Diversity-rewarded CFG distillation
CFG distillation for quality
RL for diversity
Diversity-rewarded CFG distillation
Model merging for Pareto-optimal quality-diversity
Experiments
Text-to-music generation: task, setup and metrics
CFG distillation for quality
RL for diversity
Model merging for Pareto-optimal quality-diversity
Human evaluation
Human evaluation for quality-diversity trade-off.
Related work
Discussions and limitations
...and 6 more sections

Figures (7)

Figure 1: Left. Illustration of the two objectives: CFG distillation (above) and the diversity reward (below), multiplied by the diversity coefficient $\beta$ in the joint finetuning objective. Right. Quality-diversity trade-off for different strategies. The first four lines represent the training trajectories of our approach, distilling CFG (with $\gamma=3$) with varying diversity coefficient $\beta$ in $\{0, 5, 10, 15\}$; every $500$ training steps, we evaluate the quality and diversity of the generations. Larger values of $\beta$ lead to more diverse models yet slightly less quality. For linear interpolation (LERP), each cross corresponds to a $0\leq\lambda\leq1$ when interpolating between the weights $\theta_q$ of a quality-focused model ($\beta=0$) and those $\theta_{d}$ of a diversity-focused model ($\beta=15$); the evaluated generations are obtained from the weights $(1-\lambda)\cdot \theta_q + \lambda \cdot \theta_{d}$. For the CFG baseline, each dot corresponds to a different value for the guidance factor $1\leq\gamma\leq7$. This plot shows that our method improves the quality-diversity trade-off; notably, LERP uncovers a strong and steerable front of solutions by just interpolating between the weights of two models, at deployment time.
Figure 2: Left. Evolution of the KL divergence between the CFG-distilled student and the CFG-augmented teacher along training. GKD distillation alone ($\beta=0$) decreases the KL between the two policies. Middle. Evolution of the quality along training, showing improved quality for all selected values of $\beta$. Right. Evolution of the diversity across generations along training, showing that CFG distillation alone reduces diversity, but that using a diversity reward ($\beta\neq0$) can actually increase it. The "CFG" line shows the quality/diversity performance of the CFG-augmented base model serving as a teacher. The "upper-bound" line indicates the mean diversity of two generations (from the base model) for two different prompts.
Figure 3: Quality-diversity trade-off for multiple strategies. The first four lines linear interpolate (LERP) between the quality-focused model ($\beta=0$) and more diverse models (those trained with $\beta>0$, or the base model), sliding $\lambda$ between $0$ and $1$ with a step of $0.05$. We also report the performance from the uniform ($\lambda=\frac{1}{4}$) averaging of the four models finetuned with different $\beta$, denoted as "LERP($0, 5, 10, 15$) uniform". We include inference-time baseline strategies --- CFG (when varying $\gamma$) and temperature sampling (when varying the temperature $T$) --- applied either on the base model or on the CFG-distilled model.
Figure 4: Left. Linear interpolation between the weights of a model focused on quality ($\beta=0$) and a model focused on diversity ($\beta=15$), sliding the coefficient $\lambda$ between $0$ and $1$. The dashed diagonal represents the expected values if abilities were traded-off linearly between those two models. While the diversity stays close to the diagonal, the quality is above it, explaining the benefits of model merging. Right. For comparison, we also include the results for CFG when sliding $\gamma$ between $1$ and $7$, performing worse than merged models.
Figure 5: Left. Side-by-side human evaluation for quality. Right. Side-by-side human evaluation for diversity. The score corresponds to the win rate of model A over model B, computed as $\frac{W+T/2}{W+L+T}$ with $W$ the number of wins of A over B, $T$ the number of ties, $L$ the number of losses of A against B. This confirms that our approach improves the quality-diversity trade-off. For instance, the merged model LERP(0, 15) generates music with higher diversity than the CFG-augmented base model ($\gamma=3$) in 57% of the comparisons, while being rated as more qualitative half of the time (51%).
...and 2 more figures

Diversity-Rewarded CFG Distillation

TL;DR

Abstract

Diversity-Rewarded CFG Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)