Table of Contents
Fetching ...

A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu

TL;DR

This survey unifies a broad range of parallel text generation techniques, clarifying the distinction between autoregressive compatible approaches and truly non-autoregressive paradigms. It articulates a taxonomy that groups methods into Draft-and-Verify, Decomposition-and-Fill, and Multiple Token Prediction for AR-based approaches, and One-shot, Masked, and Edit-Based strategies for Non-AR-based methods, with diffusion-based models occupying a prominent role in the Non-AR landscape. The paper synthesizes theoretical trade-offs in speed, quality, and resources, and discusses combinations and system-level accelerations that can yield substantial throughput gains while maintaining acceptable output quality. It also identifies open challenges, including the persistent quality-speed trade-off, the need for task-specific benchmarks, and the lack of modular tooling for scalable deployment. Overall, the work highlights practical directions for building faster, more efficient text generation systems that can operate at real-time scales and on resource-constrained hardware.

Abstract

As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation. We have also created a GitHub repository for indexing relevant papers and open resources available at https://github.com/zhanglingzhe0820/Awesome-Parallel-Text-Generation.

A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

TL;DR

This survey unifies a broad range of parallel text generation techniques, clarifying the distinction between autoregressive compatible approaches and truly non-autoregressive paradigms. It articulates a taxonomy that groups methods into Draft-and-Verify, Decomposition-and-Fill, and Multiple Token Prediction for AR-based approaches, and One-shot, Masked, and Edit-Based strategies for Non-AR-based methods, with diffusion-based models occupying a prominent role in the Non-AR landscape. The paper synthesizes theoretical trade-offs in speed, quality, and resources, and discusses combinations and system-level accelerations that can yield substantial throughput gains while maintaining acceptable output quality. It also identifies open challenges, including the persistent quality-speed trade-off, the need for task-specific benchmarks, and the lack of modular tooling for scalable deployment. Overall, the work highlights practical directions for building faster, more efficient text generation systems that can operate at real-time scales and on resource-constrained hardware.

Abstract

As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation. We have also created a GitHub repository for indexing relevant papers and open resources available at https://github.com/zhanglingzhe0820/Awesome-Parallel-Text-Generation.

Paper Structure

This paper contains 108 sections, 54 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Analysis of Parallel Text Generation Compared with Traditional Large Language Models
  • Figure 2: Taxonomy of parallel text generation methods
  • Figure 3: Comparison of draft-and-verify with autoregressive decoding. Autoregressive decoding generates tokens one by one in an autoregressive manner, resulting in an unsatisfactory decoding speed. In contrast, speculative decoding employs a more efficient model as a drafter to rapidly generate tokens, which are then verified by the target model. High-quality tokens are accepted while low-quality ones are discarded, thus achieving a form of parallelized generation.
  • Figure 4: Taxonomy of draft and verifying methods
  • Figure 5: Taxonomy of Multi-Token Prediction Methods
  • ...and 6 more figures