Table of Contents
Fetching ...

Understanding the Quality-Diversity Trade-off in Diffusion Language Models

Zak Buzzard

TL;DR

This work tackles the challenge of controlling the quality-diversity trade-off in diffusion language models that operate in embedding space for text. It introduces two inference-time techniques—classifier-free guidance and stochastic clamping—and a combined approach to tune fidelity and diversity without retraining. Using a transformer-based encoder-decoder and an anchor loss with importance sampling, the authors show competitive QQP paraphrasing results with only about three hours of training on a single GPU and provide an open-source implementation. The findings indicate that diffusion-based text generation can achieve a broad range of generation qualities and diversities efficiently, with practical length-controllable generation and favorable comparisons to state-of-the-art, while also highlighting evaluation and length-control limitations for future work.

Abstract

Diffusion models have seen immense success in modelling continuous data across a range of domains such as vision and audio. Despite the challenges of adapting diffusion models to discrete data, recent work explores their application to text generation by working in the continuous embedding space. However, these models lack a natural means to control the inherent trade-off between quality and diversity as afforded by the temperature hyperparameter in autoregressive models, hindering understanding of model performance and restricting generation quality. This work proposes the use of classifier-free guidance and stochastic clamping for manipulating the quality-diversity trade-off on sequence-to-sequence tasks, demonstrating that these techniques may be used to improve the performance of a diffusion language model.

Understanding the Quality-Diversity Trade-off in Diffusion Language Models

TL;DR

This work tackles the challenge of controlling the quality-diversity trade-off in diffusion language models that operate in embedding space for text. It introduces two inference-time techniques—classifier-free guidance and stochastic clamping—and a combined approach to tune fidelity and diversity without retraining. Using a transformer-based encoder-decoder and an anchor loss with importance sampling, the authors show competitive QQP paraphrasing results with only about three hours of training on a single GPU and provide an open-source implementation. The findings indicate that diffusion-based text generation can achieve a broad range of generation qualities and diversities efficiently, with practical length-controllable generation and favorable comparisons to state-of-the-art, while also highlighting evaluation and length-control limitations for future work.

Abstract

Diffusion models have seen immense success in modelling continuous data across a range of domains such as vision and audio. Despite the challenges of adapting diffusion models to discrete data, recent work explores their application to text generation by working in the continuous embedding space. However, these models lack a natural means to control the inherent trade-off between quality and diversity as afforded by the temperature hyperparameter in autoregressive models, hindering understanding of model performance and restricting generation quality. This work proposes the use of classifier-free guidance and stochastic clamping for manipulating the quality-diversity trade-off on sequence-to-sequence tasks, demonstrating that these techniques may be used to improve the performance of a diffusion language model.

Paper Structure

This paper contains 22 sections, 13 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Summary of the proposed method: an embedding-space sequence-to-sequence diffusion model with a transformer encoder-decoder backbone, augmented with classifier-free guidance. The diffusion process generates a sequence of embeddings $\hat{{\bf y}}_0$ which are clamped to the nearest tokens.
  • Figure 2: Quality-diversity trade-off as the temperature $\tau$ is varied (lower is better for both metrics). Quality rapidly drops for $\tau>1$. Self-BLEU is measured over 5 seeds, and error bars denote standard deviation in quality. Note that the $\tau=0$ points correspond to the usual sampling procedures with and without the clamping trick.
  • Figure 3: Effect of classifier-free guidance scale on quality (BLEU-4 / ROUGE-L). Quality massively drops for $s<0.75$, omitted for clarity. Note that $s=1$ corresponds to the usual sampling procedure.
  • Figure 4: Quality-diversity trade-off as the classifier-free guidance scale is varied. 5 seeds are used for self-BLEU and quality standard deviation error bars.
  • Figure 5: Distance of predicted embeddings $\hat{{\bf y}}_0$ (following CFG) to closest true embedding during 20-step inference, averaged over evaluation on the entire test set. Note the logarithmic scale on the y-axis due to significant differences in scale. 'Baseline' refers to the usual inference procedure without clamping or CFG. The strength values of $s=2.5$ and $s=4.0$ were chosen to maximise BLEU score.
  • ...and 3 more figures