Table of Contents
Fetching ...

Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models

Xiaochen Zhu, Georgi Karadzhov, Chenxi Whitehouse, Andreas Vlachos

TL;DR

Experiments demonstrate that, when compared to other diffusion and autoregressive baselines SLD achieves competitive or superior fluency, coherence, and contextual compatibility in automatic and human evaluations.

Abstract

Diffusion models have shown promise in text generation, but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion doesn't model word-order dependencies explicitly and operates on short, fixed output windows, while passage-level diffusion struggles with learning robust representations for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into multiple latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on four datasets demonstrate that, when compared to other diffusion and autoregressive baselines SLD achieves competitive or superior fluency, coherence, and contextual compatibility in automatic and human evaluations.

Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models

TL;DR

Experiments demonstrate that, when compared to other diffusion and autoregressive baselines SLD achieves competitive or superior fluency, coherence, and contextual compatibility in automatic and human evaluations.

Abstract

Diffusion models have shown promise in text generation, but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion doesn't model word-order dependencies explicitly and operates on short, fixed output windows, while passage-level diffusion struggles with learning robust representations for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into multiple latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on four datasets demonstrate that, when compared to other diffusion and autoregressive baselines SLD achieves competitive or superior fluency, coherence, and contextual compatibility in automatic and human evaluations.

Paper Structure

This paper contains 19 sections, 27 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Comparison of AR models (top), latent diffusion (middle), and our segment-level diffusion (bottom). Unlike latent diffusion, which de-noises a single latent representation, our method splits outputs and representation into segments as the cross-attention target for conditional generation with parallel autoregressive decoding, improving text quality and controllability.
  • Figure 2: Overview of the training pipeline of SLD. In the first stage, gold output is divided into segments. In the second stage, we use contrastive and adversarial learning to ensure latent representations are robust to drastic semantic changes. Finally, we train a diffusion model as an inherent semantic planner conditioned on given inputs.
  • Figure 3: BLEU score of different auto-encoder/decoder models for text conversion on DialogSum dataset of a single utterance.
  • Figure 4: Violin plot of performance gain distribution of the DeliData dialogue continuations. Our model demonstrate a closer distribution with respect to the gold distribution, demonstrating the output is better controlled.
  • Figure 5: An example of human evaluation interface of DialogSum. Each session starts with a comprehensive instructions with examples, followed by model outputs and questions.
  • ...and 1 more figures