Table of Contents
Fetching ...

MetaState: Persistent Working Memory for Discrete Diffusion Language Models

Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Qirui Jin, Wenke Lee

TL;DR

Persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models by introducing negligible trainable parameters while keeping the backbone frozen.

Abstract

Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.

MetaState: Persistent Working Memory for Discrete Diffusion Language Models

TL;DR

Persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models by introducing negligible trainable parameters while keeping the backbone frozen.

Abstract

Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with -step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.
Paper Structure (33 sections, 21 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 33 sections, 21 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: The Information Island problem in discrete diffusion: sampling and remasking compress continuous hidden activations into discrete tokens, creating a lossy bottleneck between denoising steps. MetaState mitigates this by maintaining a persistent state across steps.
  • Figure 2: Performance of MetaState-augmented models compared with frozen baselines on reasoning and coding benchmarks for LLaDA-8B and Dream-7B.
  • Figure 3: Overview of the MetaState architecture. The three modules (Injector, Mixer, Updater) and shared time conditioner form a recurrent loop around the frozen backbone, propagating a persistent state across denoising steps.