Table of Contents
Fetching ...

Decoding Large Language Diffusion Models with Foreseeing Movement

Yichuan Mo, Quan Chen, Mingjie Li, Zeming Wei, Yisen Wang

TL;DR

Problem: decoding-order sensitivity in LLDMs undermines performance. Approach: Foreseeing Decoding Method (FDM) uses both local confidence and global future impact via a discrete beam-search-like strategy; FDM-A adds an adaptive acceleration mechanism by exploiting consistency and phase-based exploration. Theoretical contribution: proves that FDM reduces KL divergence to the data distribution compared with heuristic decoding, with a bound expressed via mutual information. Empirical findings: across GSM8K, ARC, HumanEval, Countdown, and multiple LLDM variants, FDM consistently outperforms baselines, and FDM-A achieves a strong speed-accuracy trade-off, validating its practicality as an inference-time scaling method.

Abstract

Large Language Diffusion Models (LLDMs) benefit from a flexible decoding mechanism that enables parallelized inference and controllable generations over autoregressive models. Yet such flexibility introduces a critical challenge: inference performance becomes highly sensitive to the decoding order of tokens. Existing heuristic methods, however, focus mainly on local effects while overlooking long-term impacts. To address this limitation, we propose the Foreseeing Decoding Method (FDM), a novel approach that integrates both local and global considerations to unlock the full potential, employing a search-based strategy to enable effective optimization in discrete spaces. Furthermore, by analyzing the consistency of chosen tokens in the full decoding process, we develop a variant, FDM with Acceleration (FDM-A), which restricts deep exploration to critical steps identified as the exploration and balance circumantences. Extensive experiments across diverse benchmarks and model architectures validate the scalability of FDM and demonstrate the superior efficiency-performance trade-off achieved by FDM-A. Our work might potentially provide a principled step toward more powerful decoding methods for LLDMs.

Decoding Large Language Diffusion Models with Foreseeing Movement

TL;DR

Problem: decoding-order sensitivity in LLDMs undermines performance. Approach: Foreseeing Decoding Method (FDM) uses both local confidence and global future impact via a discrete beam-search-like strategy; FDM-A adds an adaptive acceleration mechanism by exploiting consistency and phase-based exploration. Theoretical contribution: proves that FDM reduces KL divergence to the data distribution compared with heuristic decoding, with a bound expressed via mutual information. Empirical findings: across GSM8K, ARC, HumanEval, Countdown, and multiple LLDM variants, FDM consistently outperforms baselines, and FDM-A achieves a strong speed-accuracy trade-off, validating its practicality as an inference-time scaling method.

Abstract

Large Language Diffusion Models (LLDMs) benefit from a flexible decoding mechanism that enables parallelized inference and controllable generations over autoregressive models. Yet such flexibility introduces a critical challenge: inference performance becomes highly sensitive to the decoding order of tokens. Existing heuristic methods, however, focus mainly on local effects while overlooking long-term impacts. To address this limitation, we propose the Foreseeing Decoding Method (FDM), a novel approach that integrates both local and global considerations to unlock the full potential, employing a search-based strategy to enable effective optimization in discrete spaces. Furthermore, by analyzing the consistency of chosen tokens in the full decoding process, we develop a variant, FDM with Acceleration (FDM-A), which restricts deep exploration to critical steps identified as the exploration and balance circumantences. Extensive experiments across diverse benchmarks and model architectures validate the scalability of FDM and demonstrate the superior efficiency-performance trade-off achieved by FDM-A. Our work might potentially provide a principled step toward more powerful decoding methods for LLDMs.

Paper Structure

This paper contains 15 sections, 1 theorem, 45 equations, 19 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Let $\Delta_{\!\text{total}}\triangleq\sum_{t=1}^{T}\mathbb E_{p_{data}(\mathbf{x}_{t-1})}\!\bigl[\mathcal{I}_{p_{data}}(\mathbf{x}_t;\mathbf{x}_{T}|\mathbf{x}_{t-1})\bigr]$, where $\mathcal{I}_{p_{data}}(\mathbf{x}_t;\mathbf{x}_{T}|\mathbf{x}_{t-1})$ is the conditional mutual information under $q$.

Figures (19)

  • Figure 1: The pipeline of FDM. We first compress the search space into a small set $\Lambda$ by filtering out candidates of lower local confidence i.e.$C_{local}$. In the final, we incorporate both local and global confidence to decide the ultimate choice at step $t$.
  • Figure 2: The consistency ratio of selecting the next decoding token using $C_{local}$ alone versus both $C_{local}$ and $C_{global}$. The decisions of both strategies are made based on the same $\mathbf{x}_{t-1}$ in each step. Peak points are observed on the steps of 64 and 128 because we follow the proposed semi-autoregressive pipeline in nie2025large with the block size 64.
  • Figure 3: The effect of different decoding strategy to $\mathbf{x}_T$ at the step $t$ given the identical $\mathbf{x}_{t-1}$. The influence gradually decreases as $t$ increases from 0 to T.
  • Figure 4: The influence of $K$ to model performance on GSM8K and Countdown benchmarks.
  • Figure 5: The influence of $\gamma$ to model performance on GSM8K and Countdown benchmarks.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Theorem 1