Table of Contents
Fetching ...

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang

TL;DR

This work tackles the scalability gap between auto-regressive LLMs and diffusion-based models by converting pre-trained AR checkpoints into discrete diffusion LLMs (dLLMs) at 100B parameters. It introduces Warmup-Stable-Decay (WSD) to adapt AR models to block diffusion, along with a document-level attention mask and a top-k checkpoint merge to improve stability and efficiency. A post-training pipeline with SFT, CAP training, and DPO aligns the model to instructions and human preferences while maintaining fast parallel decoding, culminating in LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B) that show competitive or superior performance on a broad benchmark suite. The results demonstrate promising capabilities in reasoning, coding, and agent-like tool use, highlighting diffusion models as a viable frontier for scalable, deployable LLMs.

Abstract

This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

TL;DR

This work tackles the scalability gap between auto-regressive LLMs and diffusion-based models by converting pre-trained AR checkpoints into discrete diffusion LLMs (dLLMs) at 100B parameters. It introduces Warmup-Stable-Decay (WSD) to adapt AR models to block diffusion, along with a document-level attention mask and a top-k checkpoint merge to improve stability and efficiency. A post-training pipeline with SFT, CAP training, and DPO aligns the model to instructions and human preferences while maintaining fast parallel decoding, culminating in LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B) that show competitive or superior performance on a broad benchmark suite. The results demonstrate promising capabilities in reasoning, coding, and agent-like tool use, highlighting diffusion models as a viable frontier for scalable, deployable LLMs.

Abstract

This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

Paper Structure

This paper contains 36 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: LLaDA2.0-flash main results.
  • Figure 2: A schematic of the progressive training framework for transforming an AR model into a MDLM. Continual Pre-Training Stage facilitates the Warmup-Stable-Decay strategies by scheduling block size $L_{B}$ enables smooth, stable, and effective attention mask adaptation. Post-training Stage facilitates the same block diffusion configuration conducting the instruction SFT, Confidence-Aware Parallel SFT, and DPO. The right panel illustrates the document-level block diffusion attention mask,which enables an efficient, vectorized forward pass by constructing a single input sequence from multiple noisy and clean examples, such as $[\bm{x}_{\text{noisy1}},\dots,\bm{x}_{\text{clean1}},\dots]$. The forward pass then employs a combination of block-diagonal ($\mathbf{M}_{\text{BD}}$), offset block-causal ($\mathbf{M}_{\text{OBC}}$), and block-causal ($\mathbf{M}_{\text{BC}}$) masks.
  • Figure 3: Average score and tokens‑per‑forward (TPF) for LLaDA2.0‑flash with and without CAP across 12 benchmarks. Inference speed (tokens per second) of LLaDA2.0‑flash compared with similarly sized AR models on 4 code and math benchmarks.
  • Figure 4: Score/TPF vs threshold/block size
  • Figure 5: Performance on the RULER benchmark.
  • ...and 1 more figures