Table of Contents
Fetching ...

A Survey on Diffusion Language Models

Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen

TL;DR

Diffusion Language Models (DLMs) offer a parallelizable alternative to autoregressive LLMs by denoising over either continuous embeddings or discrete tokens, enabling bidirectional context and faster inference. The survey presents a comprehensive taxonomy of continuous, discrete, and hybrid AR–diffusion paradigms, reviews training (pre-training and post-training) and inference strategies (parallel decoding, unmasking/remasking, guidance, caching, step distillation), and surveys multimodal and unified diffusion models. It documents performance trends, downstream applications across NLP, code, biology, and robotics, and analyzes key challenges such as parallelism trade-offs, infrastructure, long-context handling, and scalability, while outlining future directions. Overall, the work establishes a structured framework for understanding DLMs, highlights practical gains in efficiency and controllability, and points to avenues (e.g., agent-based reasoning, low-bit deployment, and cross-modal integration) where diffusion-based approaches may surpass traditional autoregressive methods in real-world settings.

Abstract

Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

A Survey on Diffusion Language Models

TL;DR

Diffusion Language Models (DLMs) offer a parallelizable alternative to autoregressive LLMs by denoising over either continuous embeddings or discrete tokens, enabling bidirectional context and faster inference. The survey presents a comprehensive taxonomy of continuous, discrete, and hybrid AR–diffusion paradigms, reviews training (pre-training and post-training) and inference strategies (parallel decoding, unmasking/remasking, guidance, caching, step distillation), and surveys multimodal and unified diffusion models. It documents performance trends, downstream applications across NLP, code, biology, and robotics, and analyzes key challenges such as parallelism trade-offs, infrastructure, long-context handling, and scalability, while outlining future directions. Overall, the work establishes a structured framework for understanding DLMs, highlights practical gains in efficiency and controllability, and points to avenues (e.g., agent-based reasoning, low-bit deployment, and cross-modal integration) where diffusion-based approaches may surpass traditional autoregressive methods in real-world settings.

Abstract

Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

Paper Structure

This paper contains 31 sections, 14 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Timeline of Diffusion Language Models. This figure highlights key milestones in the development of DLMs, categorized into three groups: continuous DLMs, discrete DLMs, and recent multimodal DLMs. We observe that while early research predominantly focused on continuous DLMs, discrete DLMs have gained increasing popularity in more recent years.
  • Figure 2: Trend of diffusion language model papers. For discrete DLM, the statistics are drawn from papers citing D3PM austin2021structured, with a further selection of those whose titles or abstracts include the keyword "language". For continuous DLM, the statistics are based on the number of related studies documented in the repository associated with this paper. The results reflect a growing research interest in this domain. The statistics are for reference only.
  • Figure 3: A taxonomy of Diffusion Language Models, covering foundations, training and inference strategies, and key applications. The section numbers (§) correspond to the sections in this survey.
  • Figure 4: An overview of training and inference procedures across different paradigms of Diffusion Language Models, with autoregressive (AR) models included for comparison. AR models are trained using teacher forcing and causal attention, whereas both discrete and continuous DLMs employ fully bidirectional attention mechanisms. Block-wise diffusion models, exemplified by BD3-LM arriolablock, integrate autoregressive and diffusion strategies, and are trained using a specially designed block-causal attention mask.
  • Figure 5: Inference Techniques of Diffusion Language Models. We illustrate six different strategies here, including: (a) Parallel Decoding; (b) Unmasking & Remasking; (c) Classifier-free Guidance; (d) Key-Value Cache; (e) Feature Cache; and (f) Step Distillation.
  • ...and 2 more figures