Discrete Stochastic Localization for Non-autoregressive Generation

Yunshu Wu; Jiayi Cheng; Partha Thakuria; Rob Brekelmans; Evangelos E. Papalexakis; Greg Ver Steeg

Discrete Stochastic Localization for Non-autoregressive Generation

Yunshu Wu, Jiayi Cheng, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

TL;DR

This work proposes \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer.

Abstract

Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emph{training alone} can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with $\sim$4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.

Discrete Stochastic Localization for Non-autoregressive Generation

TL;DR

Abstract

fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.

Paper Structure (52 sections, 2 theorems, 49 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 52 sections, 2 theorems, 49 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Background
Discrete Stochastic Localization
DSL Training and Architecture
DSL objective
Softmax converter for stable attention under noisy embeddings
Time-invariant DiT-style denoiser
Sampling with MDLM and ReMDM
Mask diffusion samplers
A reproducible protocol for choosing ReMDM hyperparameters under a step budget
Experiments
Setup
Results
Mechanistic Analysis
Toy case study: remasking must be targeted
...and 37 more sections

Key Result

Theorem 1.1

Consider the signal model where ${\bm{x}}$ is arbitrarily distributed with finite second-order moments, and ${\bm \epsilon}_i \sim \mathcal{N}(0, \mathbf{I})$ is independent Gaussian noise. Then the gradient of the Kullback–Leibler divergence between the conditional output distribution $p_{\bm \gamma}(\bm z|{\bm{x}})$ and th where $E_i({\bm{x}}, {\bm \gamma}) \equiv \mathbb{E}_{p_{\bm \gamma}(\bm

Figures (6)

Figure 1: Discrete Stochastic Localization (DSL). A single SNR-invariant denoiser supports arbitrary per-token SNR paths, including remasking-induced "backtracking", motivating mixed-corruption training to better match refinement-time drafts and improve step-efficiency.
Figure 2: Sampling diagnostics under a fixed step budget. (a) Masking and reveal schedule. (b) Remasking intensity and realized rewrites per token. (c) Posterior sharpening measured by mean max-probability and top-$p$ nucleus size.
Figure 3: ReMDM-style discrete correction on a cyclic toy. (a) We corrupt the ground-truth sequence ABCDEFG by masking two positions and inserting two visible-but-wrong tokens, yielding __CDBFF. (b) Confidence-driven remasking enables subsequent steps to rewrite low-confidence visible tokens; the refinement trajectory corrects both masked and wrong tokens and can recover ABCDEFG within 10 refinement steps in this example.
Figure 4: Endpoint smoothing improves near-clean calibration. We compare atomic ROAR endpoints ($\gamma\!\in\!\{0,\gamma_{\max}\}$) to smoothed endpoint ranges ($\gamma\!\sim\!\mathrm{Unif}(0,\gamma_{\min})$ or $\gamma\!\sim\!\mathrm{Unif}(c\gamma_{\max},\gamma_{\max})$). On held-out data, we measure calibration under teacher forcing on corrupted inputs; smoothing reduces ECE at large SNR and yields reliability closer to the diagonal at $\mathrm{SNR}{=}100$.
Figure 5: Endpoint smoothing improves the step--quality trade-off under fixed decoding. Using the same ReMDM-style sampler with principled $\eta$-cap (eta_cap on; no uncertainty-guided remasking) and identical schedules, smoothed-endpoint checkpoints achieve higher MAUVE across step budgets (and comparable/better GenPPL), while atomic endpoints show weak gains as $T$ increases.
...and 1 more figures

Theorems & Definitions (2)

Theorem 1.1: palomar2005gradient, Thm. 5
Lemma 1.2

Discrete Stochastic Localization for Non-autoregressive Generation

TL;DR

Abstract

Discrete Stochastic Localization for Non-autoregressive Generation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)