dLLM: Simple Diffusion Language Modeling

Zhanhui Zhou; Lingjie Chen; Hanghang Tong; Dawn Song

dLLM: Simple Diffusion Language Modeling

Zhanhui Zhou, Lingjie Chen, Hanghang Tong, Dawn Song

TL;DR

dLLM is introduced, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs.

Abstract

Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.

dLLM: Simple Diffusion Language Modeling

TL;DR

Abstract

Paper Structure (34 sections, 3 equations, 8 figures, 7 tables)

This paper contains 34 sections, 3 equations, 8 figures, 7 tables.

Introduction
Preliminaries
Discrete Diffusion.
Masked Diffusion (MDLM).
Block Diffusion (BD3LM).
dLLM Overview
Trainer
Unified training interface with Trainer (Figure \ref{['fig:modular-trainer']}).
Modular design enables easy customization (Figure \ref{['fig:modular-trainer']}).
Simple yet scalable training powered by HF infrastructure.
Sampler
Unified inference interface with Sampler (Figure \ref{['fig:inference-pipeline']}).
Terminal visualizer (Figure \ref{['fig:terminal-visualization']}).
Efficient DLM inference (Figures \ref{['fig:inference-pipeline']} & \ref{['fig:fastdllm-eval']}).
Evaluation
...and 19 more sections

Figures (8)

Figure 1: A unified trainer interface supports a variety of purposes via modular trainers and configuration changes. Figure \ref{['fig:modular-trainer:mdlm-pt']} shows the MDLM pretraining setup. Figure \ref{['fig:modular-trainer:bd3lm-pt']} shows the single-line trainer swap from [b]MDLMTrainer to [b]BD3LMTrainer . Figure \ref{['fig:modular-trainer:mdlm-sft']} shows the minimal changes to use [b]MDLMTrainer for SFT: [b]NoAttentionMaskWrapper keeps padding EOS visible, and [b]label_pad_token_id=eos_token_id trains the model to generate EOS from extra mask tokens in inputs. Figure \ref{['fig:modular-trainer:mdlm-ar-to-mdlm']} shows the minimal changes to adapt an autoregressive LM to MDLM: [b]right_shift_logits reuses next-token prediction, and [b]PrependBOSWrapper prepends BOS to provide the predictions for the first mask token.
Figure 2: Inference pipeline: sampler swap from vanilla to FastdLLM MDLM sampler.
Figure 3: Terminal Visualizer showing transition from masked to decoded tokens.
Figure 4: Fast-dLLM evaluation results with max new tokens @ $256$ and $512$. Model selection follows the original Fast-dLLM evaluation for consistency and fair comparison. Cache uses block-wise approximate KV caching within each decoding block; Parallel uses confidence-based parallel token updates; Cache & Parallel combines both. Note that max new tokens determines the number of pre-allocated padding tokens in the bidirectional context window, therefore affecting compute and measured performance.
Figure 5: Sensitivity to decoding hyperparameters. We vary individual sampling hyperparameters at inference time and observe that performance can degrade sharply from the optimal configuration. Baseline denotes the best-performing setting; Suppress does not suppress <eos> from the beginning of generation; CFG sets cfg=0.5; Parallel @ $4$ generates four tokens per step; and Temp @ $0$ sets temperature=0.0.
...and 3 more figures

dLLM: Simple Diffusion Language Modeling

TL;DR

Abstract

dLLM: Simple Diffusion Language Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (8)