Diversity-Rewarded CFG Distillation
Geoffrey Cideron, Andrea Agostinelli, Johan Ferret, Sertan Girgin, Romuald Elie, Olivier Bachem, Sarah Perrin, Alexandre Ramé
TL;DR
The paper tackles the CFG limitation of increased inference cost and reduced diversity by introducing diversity-rewarded CFG distillation, which distills CFG quality into model weights while promoting diversity through reinforcement learning with an embedding-based diversity reward. It formulates a joint objective QD(θ) = Q(θ) + β D(θ) and enables deployment-time quality-diversity control via model merging, interpolating between a quality-focused and a diversity-focused model with θ_LERP = (1−λ) θ_q + λ θ_d. The approach is validated on text-to-music generation using MusicLM, showing improved quality-diversity Pareto-optimality over CFG, and human evaluations demonstrate that weight-merged models can achieve higher diversity without sacrificing quality. The work highlights practical benefits for creative AI tasks by eliminating CFG inference costs while enabling tunable exploration, and suggests extensions to other modalities and future refinements in diversity metrics.
Abstract
Generative models are transforming creative domains such as music generation, with inference-time strategies like Classifier-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By finetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model merging strategies: by interpolating between the weights of two models (the first focusing on quality, the second on diversity), we can control the quality-diversity trade-off at deployment time, and even further boost performance. We conduct extensive experiments on the MusicLM (Agostinelli et al., 2023) text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our finetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG. Explore our generations at https://google-research.github.io/seanet/musiclm/diverse_music/.
