Table of Contents
Fetching ...

Sink-Aware Pruning for Diffusion Language Models

Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen

TL;DR

This work proposes sink-Aware Pruning, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs), and achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute.

Abstract

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

Sink-Aware Pruning for Diffusion Language Models

TL;DR

This work proposes sink-Aware Pruning, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs), and achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute.

Abstract

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose , which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.
Paper Structure (27 sections, 14 equations, 8 figures, 6 tables)

This paper contains 27 sections, 14 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Illustration of attention sink behaviors in Diffusion Language Models. Ours Sink-Aware Pruning reduces sink variance by downscaling unstable sinks.
  • Figure 2: Attention sink heatmap dynamics across generation steps for AR LLM (LLaMA-3-8B) and DLM (LLaDA). For each model, we show 3 different generation stages (25, 50, and 75% of the total generation steps) and plot the attention mass received by each token position (y-axis) across all heads/layers (x-axis). In LLaMA, the sink position (deep-blue vertical band) is stable across steps, while in LLaDA, the sink position shifts significantly across diffusion steps, indicating higher sink variance. The step in AR model refers to the generation process.
  • Figure 3: Overview of Sink-Aware Pruning. Given input activations, we compute per-token attention mass aggregated across all layers and heads (Step 1), identify sink tokens via a threshold-based criterion, and derive a soft down-weighting factor $\omega = 1-s$. The original activation $X$ is then suppressed at sink positions to produce a new activation $\tilde{X}$ (Step 2), which is substituted into existing pruning criteria, Wanda or SparseGPT, to compute sink-aware importance scores (Step 3). Final pruning decisions are made based on the updated scores (Step 4).
  • Figure 4: Attention sink variance for diffusion LMs (LLaDA, Dream) and autoregressive LMs (Llama 3.1, Qwen 2.5).
  • Figure 5: Sink position across generation/denoising steps for diffusion and AR LMs. Shaded regions denote $\pm$ std across runs.
  • ...and 3 more figures