Table of Contents
Fetching ...

UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models

Guangxin He, Shen Nie, Fengqi Zhu, Yuankang Zhao, Tianyi Bai, Ran Yan, Jie Fu, Chongxuan Li, Binhang Yuan

TL;DR

The paper addresses scaling diffusion LLMs to very long contexts by introducing a diffusion-aware NTK extrapolation for Rotary Positional Embeddings (RoPE) and conducting systematic, lightweight post-training with long-sequence masking strategies. It proposes UltraLLaDA, a diffusion LLM with a $128K$ token context, which outperforms training-free baselines on long-context tasks and maintains low perplexity across extreme lengths. Key contributions include the diffusion-aware NTK formulation, analysis of masking strategies to mitigate cross-document interference, and extensive empirical validation across NIAH, LongBench, and RULER benchmarks. The work provides practical guidance for achieving long-context capabilities in diffusion LLMs with efficient post-training, enabling robust performance on very long documents and complex retrieval tasks.

Abstract

Diffusion LLMs have attracted growing interest, with plenty of recent work emphasizing their great potential in various downstream tasks; yet the long-context behavior of diffusion LLMs remains largely uncharted. We present a case study of post-training techniques for extending the context window of diffusion LLMs (i.e., LLaDA) without retraining from scratch. We show that a simple modification to the standard Rotary Positional Embeddings (RoPE) extension effectively accommodates the probabilistic modeling inherent in the diffusion process, enabling stable scaling to longer context ranges. We further compare masking strategies used during post-training and analyze their impact on optimization stability and long-range recall. Instantiating these insights, we introduce UltraLLaDA, a diffusion LLM with a 128K-token context window that, in our empirical evaluation on long-context tasks, significantly outperforms training-free baselines. Our experimental results highlight the special positional extension as a key lever for scaling diffusion LLMs to extended contexts and offer practical guidance for practitioners seeking 128K-scale context via efficient post-training.

UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models

TL;DR

The paper addresses scaling diffusion LLMs to very long contexts by introducing a diffusion-aware NTK extrapolation for Rotary Positional Embeddings (RoPE) and conducting systematic, lightweight post-training with long-sequence masking strategies. It proposes UltraLLaDA, a diffusion LLM with a token context, which outperforms training-free baselines on long-context tasks and maintains low perplexity across extreme lengths. Key contributions include the diffusion-aware NTK formulation, analysis of masking strategies to mitigate cross-document interference, and extensive empirical validation across NIAH, LongBench, and RULER benchmarks. The work provides practical guidance for achieving long-context capabilities in diffusion LLMs with efficient post-training, enabling robust performance on very long documents and complex retrieval tasks.

Abstract

Diffusion LLMs have attracted growing interest, with plenty of recent work emphasizing their great potential in various downstream tasks; yet the long-context behavior of diffusion LLMs remains largely uncharted. We present a case study of post-training techniques for extending the context window of diffusion LLMs (i.e., LLaDA) without retraining from scratch. We show that a simple modification to the standard Rotary Positional Embeddings (RoPE) extension effectively accommodates the probabilistic modeling inherent in the diffusion process, enabling stable scaling to longer context ranges. We further compare masking strategies used during post-training and analyze their impact on optimization stability and long-range recall. Instantiating these insights, we introduce UltraLLaDA, a diffusion LLM with a 128K-token context window that, in our empirical evaluation on long-context tasks, significantly outperforms training-free baselines. Our experimental results highlight the special positional extension as a key lever for scaling diffusion LLMs to extended contexts and offer practical guidance for practitioners seeking 128K-scale context via efficient post-training.

Paper Structure

This paper contains 11 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: NIAH evaluation up to 128K context-length. UltraLLaDA can find all of the needles within the context window 8--32$\times$ longer than that LongLLaDA can handle.
  • Figure 2: RoPE critical dimension and training-free case study under different NTK scaling.
  • Figure 3: Different attention mechanism for Long-context training.