Table of Contents
Fetching ...

Energy-Based Diffusion Language Models for Text Generation

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, Arash Vahdat

TL;DR

This work tackles the performance gap of discrete diffusion models for text by introducing Energy-based Diffusion Language Model (EDLM), a residual energy-based denoiser that operates on full sequences at each diffusion step. The energy function can be instantiated from a pretrained autoregressive model or learned via noise-contrastive estimation, enabling efficient parallel sampling through importance sampling. Empirical results on Text8 and OpenWebText show that EDLM consistently surpasses prior diffusion baselines and approaches autoregressive perplexities, while achieving up to 1.3x faster sampling. Overall, EDLM provides a principled path to high-quality, fast parallel text generation by integrating energy-based denoising with discrete diffusion.

Abstract

Despite remarkable progress in autoregressive language models, alternative generative paradigms beyond left-to-right generation are still being actively explored. Discrete diffusion models, with the capacity for parallel generation, have recently emerged as a promising alternative. Unfortunately, these models still underperform the autoregressive counterparts, with the performance gap increasing when reducing the number of sampling steps. Our analysis reveals that this degradation is a consequence of an imperfect approximation used by diffusion models. In this work, we propose Energy-based Diffusion Language Model (EDLM), an energy-based model operating at the full sequence level for each diffusion step, introduced to improve the underlying approximation used by diffusion models. More specifically, we introduce an EBM in a residual form, and show that its parameters can be obtained by leveraging a pretrained autoregressive model or by finetuning a bidirectional transformer via noise contrastive estimation. We also propose an efficient generation algorithm via parallel important sampling. Comprehensive experiments on language modeling benchmarks show that our model can consistently outperform state-of-the-art diffusion models by a significant margin, and approaches autoregressive models' perplexity. We further show that, without any generation performance drop, our framework offers a 1.3$\times$ sampling speedup over existing diffusion models. Reproduced code is available at https://github.com/MinkaiXu/Energy-Diffusion-LLM.

Energy-Based Diffusion Language Models for Text Generation

TL;DR

This work tackles the performance gap of discrete diffusion models for text by introducing Energy-based Diffusion Language Model (EDLM), a residual energy-based denoiser that operates on full sequences at each diffusion step. The energy function can be instantiated from a pretrained autoregressive model or learned via noise-contrastive estimation, enabling efficient parallel sampling through importance sampling. Empirical results on Text8 and OpenWebText show that EDLM consistently surpasses prior diffusion baselines and approaches autoregressive perplexities, while achieving up to 1.3x faster sampling. Overall, EDLM provides a principled path to high-quality, fast parallel text generation by integrating energy-based denoising with discrete diffusion.

Abstract

Despite remarkable progress in autoregressive language models, alternative generative paradigms beyond left-to-right generation are still being actively explored. Discrete diffusion models, with the capacity for parallel generation, have recently emerged as a promising alternative. Unfortunately, these models still underperform the autoregressive counterparts, with the performance gap increasing when reducing the number of sampling steps. Our analysis reveals that this degradation is a consequence of an imperfect approximation used by diffusion models. In this work, we propose Energy-based Diffusion Language Model (EDLM), an energy-based model operating at the full sequence level for each diffusion step, introduced to improve the underlying approximation used by diffusion models. More specifically, we introduce an EBM in a residual form, and show that its parameters can be obtained by leveraging a pretrained autoregressive model or by finetuning a bidirectional transformer via noise contrastive estimation. We also propose an efficient generation algorithm via parallel important sampling. Comprehensive experiments on language modeling benchmarks show that our model can consistently outperform state-of-the-art diffusion models by a significant margin, and approaches autoregressive models' perplexity. We further show that, without any generation performance drop, our framework offers a 1.3 sampling speedup over existing diffusion models. Reproduced code is available at https://github.com/MinkaiXu/Energy-Diffusion-LLM.

Paper Structure

This paper contains 24 sections, 2 theorems, 18 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Given diffused data ${\mathbf{x}}_t$ at timestep $t$, let $\log {\mathcal{Z}}_n$ denote the empirical estimation of $\log Z_\phi({\mathbf{x}}_t) = \log \mathbb{E}_{{\mathbf{x}}_0 \sim p_\theta} \exp(-{\bm{E}}_\phi({\mathbf{x}}_0,{\mathbf{x}}_t))$ with $n$ samples ${\mathbf{x}}_0^{(i)}\sim p_\theta (

Figures (5)

  • Figure 1: Analysis and ablation study for EDLM. \ref{['subfig:genppl_time', 'subfig:entropy_time']}: we run AR baseline and diffusion-based models with $[512, 768, 1024]$ denoising steps, and plot the curve of corresponding metric vs. wall-clock time. For generative perplexity, the metric is evaluated by GPT-2, and a curve on the bottom-left indicates a better sampling quality vs time trade-off. \ref{['subfig:is-ablation']}: ablation study of EDLM under different importance sampling size and window length.
  • Figure 2: Behavior of the energy function under different parameterization. The first row plots the energy of positive samples and the average energy of negative samples; the second row plots the maximum and minimum energy values of the 16 negative samples; the third row plots the effective sampling size (ESS) for energies of the 16 negative samples. Different columns correspond to results for different EDLM parameterization.
  • Figure 3: Additional EDLM sampling results with varying importance sampling window ${\mathbf{w}}$. Similar to \ref{['subsec:exp-ablation']}, we run each setting with $[512, 768, 1024]$ denoising steps, and plot the curve of the corresponding metric vs. wall-clock time. For generative perplexity, the metric is evaluated by GPT-2, and a curve on the bottom-left indicates a better sampling quality vs time trade-off.
  • Figure 4: Gen. PPL vs. Entropy.
  • Figure : Denoising via Importance Sampling

Theorems & Definitions (3)

  • Theorem 1
  • Theorem \ref{theorem:partition-estimation}
  • proof