Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs

Linrui Ma; Yufei Cui; Kai Han; Yunhe Wang

Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs

Linrui Ma, Yufei Cui, Kai Han, Yunhe Wang

Abstract

Discrete diffusion models offer global context awareness and flexible parallel generation. However, uniform random noise schedulers in standard DLLM training overlook the highly non-uniform information density inherent in real-world sequences. This wastes optimization resources on low-density structural glues while leaving high-density logical pivot points severely under-optimized. To address this, we propose an Information Density Driven Smart Noise Scheduler. By extracting information-dense hubs and applying Complementary Priority Masking, our method decouples a single training instance into mutually reinforcing reasoning and syntax samples, forcing the model to master both logical deduction and foundational sequence structure. Experiments demonstrate that our approach improves average accuracy by ~4\% across four Code and Math reasoning benchmarks, significantly outperforming uniform baselines. Mechanistic analyses further reveal that probabilistic priority masking effectively mitigates contextual collapse during block diffusion training. Overall, this density-aware strategy efficiently unlocks the reasoning potential of diffusion language models at minimal annotation cost, emerging as a promising new masked data training paradigm for Diffusion LLMs. Our processed dataset can be found at https://huggingface.co/datasets/malr07/opc-sft-stage2-dense-extracted.

Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs

Abstract

Paper Structure (18 sections, 3 equations, 4 figures, 2 tables)

This paper contains 18 sections, 3 equations, 4 figures, 2 tables.

Introduction
Methodology
Information-Dense Region Extraction
Complementary Priority Noise Scheduling
Limitations of traditional uniform scheduling
Priority Masking
Complementary Masking & Decoupling
Experiments
Experimental Setup
Model and Training Details
Dataset Preparation
Supervised Fine-Tuning Results
Ablation Studies
Impact of Bias Weight $w$
Soft vs. Hard Priority Masking
...and 3 more sections

Figures (4)

Figure 1: An illustrative comparison of random masking versus our density driven masking
Figure 2: Overview of the Information Density Driven Noise Scheduler Pipeline for one sample. Left: The sample first undergoes LLM-based dense-region identification; Right: Then the sample is duplicated into two, and the Logical Sample is made by prioritize masking dense-regions given noise ratio $\sigma_t$ and bias weight $w$, meanwhile the Syntactic Sample is made by taking logical inverse of the Logical Sample's mask.
Figure 3: Impact of different bias weights ($w$) on fixed preprocessed data (Code 10% + Math 50%)
Figure 4: Performance of models trained without complementary masking (priority masking only). Plot area with gray background indicates baseline performance with bias weight $w$ set to 1. The red dashed line indicates the results with complementary sample under the same settings.

Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs

Abstract

Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs

Authors

Abstract

Table of Contents

Figures (4)