Table of Contents
Fetching ...

MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park, Jae W. Lee

TL;DR

This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising.

Abstract

Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.

MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

TL;DR

This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising.

Abstract

Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.
Paper Structure (38 sections, 4 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 4 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Top-K recall rate across denoising steps on LongBench tasks (0: All-[MASK] block, 1: fully decoded block). The top-K indices selected from the All-[MASK] block maintain 84--90% recall throughout denoising.
  • Figure 2: Layer-wise budget across denoising progress on four LongBench tasks. Color indicates normalized budget (darker = higher). Early layers require larger budgets, while horizontal bands indicate stable relative budgets throughout denoising.
  • Figure 3: Overview of the MAGE Fine-tuning process. The process consists of three stages: (1) Index Selection identifies important KV indices via Top-K selection without gradients; (2) Sparse Forward applies the sparse mask and computes gradients; (3) Teacher Forward provides exact reference logits for self-distillation.
  • Figure 4: Per-task accuracy on LongBench with Fast-dLLM 1.5B across three denoising configurations (1, 2, 4 tokens/step). MAGE and MAGE-FT consistently outperform Quest and Tidal across all tasks and budgets. MAGE-FT often surpasses exact attention at moderate budgets, while maintaining robustness at higher tokens per step.
  • Figure 5: Per-task average denoising step latency on LongBench with Fast-dLLM 1.5B across three denoising configurations (1, 2, 4 tokens/step).
  • ...and 7 more figures