Table of Contents
Fetching ...

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan

TL;DR

The Longest Stable Prefix (LSP) scheduler is presented, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption that accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual tasks, and creative writing while matching or slightly improving output quality.

Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

TL;DR

The Longest Stable Prefix (LSP) scheduler is presented, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption that accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual tasks, and creative writing while matching or slightly improving output quality.

Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.
Paper Structure (35 sections, 3 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 3 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: The iterative process of the Longest Stable Prefix (LSP) scheduler. In each step, LSP performs a single forward pass to assess the stability of predictions for the current active suffix, measured by the logit margin ($\delta_i$). Instead of accepting scattered tokens, it atomically commits the longest contiguous prefix of tokens that meet an adaptively determined stability threshold ($\tau$). As shown, the frozen prefix (green) grows monolithically, causing the active suffix (white) to shrink.
  • Figure 2: Quantifying Repair Costs via Token Flip Rate. We measure the percentage of tokens in the active suffix that change their top prediction between consecutive diffusion steps. While the scattered baseline forces the model to constantly reconcile a fragmented context (maintaining high flip rates), LSP locks in a coherent prefix early. This stabilizes the future generation context, drastically reducing token oscillations and repair costs in the mid-to-late stages (from 14.2% down to 4.3%).