Table of Contents
Fetching ...

LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models

Chenglin Wang, Yucheng Zhou, Shawn Chen, Tao Wang, Kai Zhang

Abstract

Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ''generation frontier'', regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.

LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models

Abstract

Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ''generation frontier'', regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.
Paper Structure (34 sections, 3 theorems, 26 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 3 theorems, 26 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Given the spatial inductive bias of the encoder, the mutual information between a token $z_i$ and its immediate neighborhood $\mathcal{N}(i)$ dominates that of distant context $\mathcal{S}_{dist}$. Formally:

Figures (6)

  • Figure 1: Comparison between Standard Parallel Decoding and our LADR method. While standard parallel decoding follows a fixed schedule, LADR accelerates decoding by exploiting spatial locality to dynamically recover neighbor tokens and keeps generation quality.
  • Figure 2: Overview of the LADR method. At each timestep, the flattened discrete tokens were reshaped into a 2D grid to identify candidate neighbors adjacent to resolved regions. These candidates are evaluated using the Confidence Margin (confidence top1-top2 gap) and are dynamically "rescued" (unmasked) based on an adaptive rescue ratio $\alpha$ and threshold $\gamma$. To synchronize the generation timeline with this accelerated accumulation of tokens, the Trajectory Re-alignment module utilizes an inverse cosine function to re-calculate the effective timestep $t_2$, allowing the scheduler to skip redundant iterations while maintaining consistency.
  • Figure 3: Visualization of localized semantic changes caused by perturbing a small set of VQ tokens. The impact remains spatially confined, supporting the locality assumption that nearby tokens dominate information gain.
  • Figure 4: Qualitative comparison of different methods in terms of generation fidelity and inference time, where the corresponding text prompt is provided below each row, with the inference latency displayed in parentheses under the image.
  • Figure 5: Ablation study of spatial selection strategies on the GenEval benchmark."Random" means Random Selection, "Non-neighbor" represents Non-Neighbor Prioritization, and "Neighbor (Ours)" is our strategy that rescues neighbor tokens.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 1: Generation Frontier
  • Proposition 1: Locality-Driven Information Lower Bound
  • Theorem 1: Margin-based Error Bound
  • Proposition 2: Manifold Consistency Condition
  • proof
  • proof
  • proof