Table of Contents
Fetching ...

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou

TL;DR

DLM-Scope is presented, the first SAE-based interpretability framework for DLMs, and it is demonstrated that trained Top-K SAEs can faithfully extract interpretable features, and it is found that inserting SAEs affects DLMs differently than autoregressive LLMs.

Abstract

Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM decoding order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

TL;DR

DLM-Scope is presented, the first SAE-based interpretability framework for DLMs, and it is demonstrated that trained Top-K SAEs can faithfully extract interpretable features, and it is found that inserting SAEs affects DLMs differently than autoregressive LLMs.

Abstract

Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM decoding order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.
Paper Structure (44 sections, 18 equations, 12 figures, 4 tables)

This paper contains 44 sections, 18 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: DLMscope pipeline.Top (orange): DLM-SAE training and validation.Left: Top-$K$ SAEs are trained on Dream/LLaDA. Middle: They are evaluated via sparsity-fidelity. Right: They are auto-interpreted by generating explanations and interpretability scores. Bottom (green): The value of DLM-SAEs.Left: Feature steering is applied across denoising steps. Middle: Different decoding-orders are analyzed by tracking Top-$K$ feature dynamics. Right: Cross-training transfer is tested by applying base-trained SAEs to DLM-SFT.
  • Figure 2: DLM-SAE overview.Left: Training DLM-SAEs. We collect residual-stream activations from one-step denoising inputs and train SAEs using two strategies: Mask-SAE or Unmask-SAE. Right: Diffusion-time feature steering. We select feature $f$ and inject its decoder direction into the residual stream at every denoising step, either on all positions or update positions.
  • Figure 3: Sparsity-fidelity trade-off for Qwen SAEs and Dream SAEs.Top row: functional fidelity measured by $\Delta$LM loss (Eq. \ref{['eq:delta_loss']}); Bottom row: reconstruction fidelity measured by explained variance (Eq. \ref{['eq:ev']}). Columns: a LLM baseline (Qwen-2.5-7B, left) versus Dream-7B SAEs (Mask-SAE, middle) and (Unmask-SAE, right). This figure shows that Dream-SAEs achieve strong sparsity-fidelity trade-offs and even exhibit negative $\Delta$LM loss in shallow layers at small $L_0$, an effect absent or much weaker in the LLM baseline.
  • Figure 4: Pre-mask SAE feature stability across three DLM inference orders.Top: layer-step heatmaps of mean pre-mask top-$k$ Jaccard similarity between consecutive steps ($k\!-\!1\!\rightarrow\!k$). Bottom: mean pre-mask similarity (averaged over tracked layers/positions) vs. normalized generation progress. This figure shows that Origin yields a less dynamic SAE trajectory, while confidence-based orders exhibit structured turnover followed by stabilization.
  • Figure 5: Post-decode SAE feature drift. Drift is computed only after a position's token is fixed. Top: layer-step heatmaps of post-decode top-$k$ drift between consecutive steps' top-$k$ feature sets. Bottom: mean post-decode drift (averaged over tracked positions) vs. normalized generation progress. This figure shows that confidence-based orders sustain stronger deep-layer drift, indicating the effect of bidirectional attention is stronger in this situation, whereas Origin drifts less.
  • ...and 7 more figures