Table of Contents
Fetching ...

Discrete Diffusion Language Model for Efficient Text Summarization

Do Huu Dat, Do Duc Anh, Anh Tuan Luu, Wray Buntine

TL;DR

This work tackles the challenge of applying discrete diffusion models to conditional long-text generation by introducing a semantic-aware forward noising process and a CrossMamba encoder-decoder backbone. By leveraging Transformer-based attention-informed noising and a state-space Mamba framework, the method enables efficient long-sequence summarization with linear-time processing and competitive or superior ROUGE/bertscore metrics on Gigaword, CNN/DailyMail, and Arxiv. It achieves state-of-the-art performance among discrete diffusion approaches while delivering significantly faster decoding than autoregressive baselines. The combination of semantic-aware conditioning and CrossMamba yields strong practical impact for long-context text generation, though gaps remain relative to autoregressive models and scalability challenges in extreme long sequences remain for future work.

Abstract

While diffusion models excel at conditional generating high-quality images, prior works in discrete diffusion models were not evaluated on conditional long-text generation. In this work, we address the limitations of prior discrete diffusion models for conditional long-text generation, particularly in long sequence-to-sequence tasks such as abstractive summarization. Despite fast decoding speeds compared to autoregressive methods, previous diffusion models failed on the abstractive summarization task due to the incompatibility between the backbone architectures and the random noising process. To overcome these challenges, we introduce a novel semantic-aware noising process that enables Transformer backbones to handle long sequences effectively. Additionally, we propose CrossMamba, an adaptation of the Mamba model to the encoder-decoder paradigm, which integrates seamlessly with the random absorbing noising process. Our approaches achieve state-of-the-art performance on three benchmark summarization datasets: Gigaword, CNN/DailyMail, and Arxiv, outperforming existing discrete diffusion models on ROUGE metrics as well as possessing much faster speed in inference compared to autoregressive models.

Discrete Diffusion Language Model for Efficient Text Summarization

TL;DR

This work tackles the challenge of applying discrete diffusion models to conditional long-text generation by introducing a semantic-aware forward noising process and a CrossMamba encoder-decoder backbone. By leveraging Transformer-based attention-informed noising and a state-space Mamba framework, the method enables efficient long-sequence summarization with linear-time processing and competitive or superior ROUGE/bertscore metrics on Gigaword, CNN/DailyMail, and Arxiv. It achieves state-of-the-art performance among discrete diffusion approaches while delivering significantly faster decoding than autoregressive baselines. The combination of semantic-aware conditioning and CrossMamba yields strong practical impact for long-context text generation, though gaps remain relative to autoregressive models and scalability challenges in extreme long sequences remain for future work.

Abstract

While diffusion models excel at conditional generating high-quality images, prior works in discrete diffusion models were not evaluated on conditional long-text generation. In this work, we address the limitations of prior discrete diffusion models for conditional long-text generation, particularly in long sequence-to-sequence tasks such as abstractive summarization. Despite fast decoding speeds compared to autoregressive methods, previous diffusion models failed on the abstractive summarization task due to the incompatibility between the backbone architectures and the random noising process. To overcome these challenges, we introduce a novel semantic-aware noising process that enables Transformer backbones to handle long sequences effectively. Additionally, we propose CrossMamba, an adaptation of the Mamba model to the encoder-decoder paradigm, which integrates seamlessly with the random absorbing noising process. Our approaches achieve state-of-the-art performance on three benchmark summarization datasets: Gigaword, CNN/DailyMail, and Arxiv, outperforming existing discrete diffusion models on ROUGE metrics as well as possessing much faster speed in inference compared to autoregressive models.
Paper Structure (24 sections, 11 equations, 3 figures, 9 tables)

This paper contains 24 sections, 11 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: In contrast to conventional discrete diffusion models, we feed the full target sequence through the encoder to obtain attention scores, reflecting the relative importance of each token to the target sentence's overall semantic meaning, and use those scores to alter the absorbing probability. The higher the attention scores, the lower the probability it is absorbed to [MASK] token, which is denoted as [M].
  • Figure 2: The model consists of an encoder and a decoder. The encoder processes the input sequence ($source$), while the decoder handles the noisy target sequence. Time step information is incorporated by adding time step embeddings $t$. The semantic-aware pipeline is illustrated by the blue dashes. A [CLS] token $C$ is appended to both the source and target sequences and then passed through the encoder. The similarity loss $L_{cls}$ is computed using the two corresponding [CLS] tokens $C_s$ and $C_t$ (detach). Additionally, the attention scores $a$ from the target sequence are used in the noising process. The decoder can be standard transformer blocks that incorporate conditioning via cross-attention or CrossMamba blocks integrating conditioning with bidirectional CrossMamba.
  • Figure 3: Curves of BLEU score vs training steps on the QQP dataset with absorbing noising and semantic-aware noising.