Table of Contents
Fetching ...

Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation

Chengze Li, Yitong Zhang, Jia Li, Liyi Cai, Ge Li

TL;DR

This study interrogates diffusion large language models as an alternative to autoregressive code generation, motivated by efficiency and non-sequential programming workflows. It conducts the first comprehensive empirical evaluation of nine diffusion LLMs against four autoregressive baselines across multiple benchmarks (HumanEval, MBPP, LiveCodeBench, RepoQA) and analyzes how generation length, diffusion steps, remasking, block diffusion, and temperature affect accuracy and efficiency. Key findings show diffusion LLMs are competitive with AR models of similar size, exhibit superior length extrapolation for long code understanding, and offer substantial efficiency gains under appropriate settings, while still facing an overall gap to top AR performance. The work provides practical guidance for deploying diffusion LLMs in code generation, identifies their complementary strengths with AR models, and proposes future directions including hybrid architectures, structure-aware remasking, and optimized caching to advance real-world applicability.

Abstract

LLMs have become the mainstream approaches to code generation. Existing LLMs mainly employ autoregressive generation, i.e. generating code token-by-token from left to right. However, the underlying autoregressive generation has two limitations in code generation. First, autoregressive LLMs only generate a token at each step, showing low efficiency in practice. Second, programming is a non-sequential process involving back-and-forth editing, while autoregressive LLMs only employ the left-to-right generation order. These two intrinsic limitations hinder the further development of LLMs in code generation. Recently, diffusion LLMs have emerged as a promising alternative. Diffusion LLMs address the above limitations with two advances, including multi-token prediction (i.e. generating multiple tokens at each step) and flexible generation order (i.e. flexibly determining which positions to generate tokens). However, there is no systematic study exploring diffusion LLMs in code generation. To bridge the knowledge gap, we present the first empirical study of diffusion LLMs for code generation. Our study involves 9 representative diffusion LLMs and conduct experiments on 4 widely used benchmarks. Based on the results, we summarize the following findings. (1) Existing diffusion LLMs are competitive with autoregressive LLMs with similar sizes. (2) Diffusion LLMs have a stronger length extrapolation ability than autoregressive LLMs and perform better in long code understanding. (3) We explore factors impacting the effectiveness and efficiency of diffusion LLMs, and provide practical guidance. (4) We discuss several promising further directions to improve diffusion LLMs on code generation. We open-source all source code, data, and results to facilitate the following research. The code is publicly available at https://github.com/zhangyitonggg/dllm4code.

Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation

TL;DR

This study interrogates diffusion large language models as an alternative to autoregressive code generation, motivated by efficiency and non-sequential programming workflows. It conducts the first comprehensive empirical evaluation of nine diffusion LLMs against four autoregressive baselines across multiple benchmarks (HumanEval, MBPP, LiveCodeBench, RepoQA) and analyzes how generation length, diffusion steps, remasking, block diffusion, and temperature affect accuracy and efficiency. Key findings show diffusion LLMs are competitive with AR models of similar size, exhibit superior length extrapolation for long code understanding, and offer substantial efficiency gains under appropriate settings, while still facing an overall gap to top AR performance. The work provides practical guidance for deploying diffusion LLMs in code generation, identifies their complementary strengths with AR models, and proposes future directions including hybrid architectures, structure-aware remasking, and optimized caching to advance real-world applicability.

Abstract

LLMs have become the mainstream approaches to code generation. Existing LLMs mainly employ autoregressive generation, i.e. generating code token-by-token from left to right. However, the underlying autoregressive generation has two limitations in code generation. First, autoregressive LLMs only generate a token at each step, showing low efficiency in practice. Second, programming is a non-sequential process involving back-and-forth editing, while autoregressive LLMs only employ the left-to-right generation order. These two intrinsic limitations hinder the further development of LLMs in code generation. Recently, diffusion LLMs have emerged as a promising alternative. Diffusion LLMs address the above limitations with two advances, including multi-token prediction (i.e. generating multiple tokens at each step) and flexible generation order (i.e. flexibly determining which positions to generate tokens). However, there is no systematic study exploring diffusion LLMs in code generation. To bridge the knowledge gap, we present the first empirical study of diffusion LLMs for code generation. Our study involves 9 representative diffusion LLMs and conduct experiments on 4 widely used benchmarks. Based on the results, we summarize the following findings. (1) Existing diffusion LLMs are competitive with autoregressive LLMs with similar sizes. (2) Diffusion LLMs have a stronger length extrapolation ability than autoregressive LLMs and perform better in long code understanding. (3) We explore factors impacting the effectiveness and efficiency of diffusion LLMs, and provide practical guidance. (4) We discuss several promising further directions to improve diffusion LLMs on code generation. We open-source all source code, data, and results to facilitate the following research. The code is publicly available at https://github.com/zhangyitonggg/dllm4code.

Paper Structure

This paper contains 31 sections, 2 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Generation paradigms of AR LLMs and diffusion LLMs.
  • Figure 2: An example commit illustrating non-sequential programming.
  • Figure 3: Overview of our study. DLLM is used as an abbreviation for Diffusion LLM.
  • Figure 4: Venn diagram of tasks solved by diffusion and AR LLMs across HumanEval, MBPP, and LiveCodeBenchv1--v6.
  • Figure 5: Performance trajectory of representative diffusion LLMs and LLMs on HumanEval, measured by pass@1.
  • ...and 11 more figures