Table of Contents
Fetching ...

Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs

Bumjun Kim, Dongjae Jeon, Moongyu Jeon, Albert No

Abstract

Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs.

Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs

Abstract

Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs.
Paper Structure (41 sections, 10 equations, 16 figures, 4 tables)

This paper contains 41 sections, 10 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 2: Illustration of synthetic MRF dataset.
  • Figure 3: Accuracy--Steps trade-off of decoding strategies on two dLLMs. Top: LLaDA. Bottom: Dream. Left: code tasks (HumanEval, MBPP). Middle: math tasks (GSM8K, Math500). Right: instruction-following task (IFEval). Markers denote decoding strategies. For LLaDA, baselines are shown with block decoding and EOS suppression settings ("4 Blocks" and "EOS-Inf"), while DAPD is evaluated in the single-block regime by default. For Dream, all methods are evaluated in the single-block regime. Colored lines indicate different tasks.
  • Figure 4: Score--Steps trade-off of different decoding strategies on ParallelBench using LLaDA. Colors denote different tasks.
  • Figure 5: Left: Distribution of tokens unmasked during the initial 40% of decoding. Progress is normalized by total steps per sample and method. Heatmaps show the average unmasking trajectory, where lighter colors indicate earlier unmasking and white regions denote tokens remaining masked. Vertical color bars (labeled above) indicate average starting positions for the five questions; FD denotes Fast-dLLM. Right: Average number of isolated segments per decoding step, characterizing sequence fragmentation. Notably, DAPD exhibits a distinct unmasking pattern, whereas baselines show highly similar behaviors.
  • Figure 6: Analysis of $\tau_{\min}$. Left: LLaDA. Right: Dream.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Definition 3.1: Markov Random Field
  • Remark 3.2