Table of Contents
Fetching ...

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun

TL;DR

The paper addresses the challenge of deploying diffusion-based LLMs on resource-constrained devices by systematically evaluating post-training quantization (PTQ) for dLLMs. It identifies activation outliers as a core difficulty for low-bit weight and activation quantization and benchmarks multiple PTQ methods across bit-widths, task types, and model variants (LLaDA-8B and Dream-7B). Key findings show that 4-bit weight-only quantization with GPTQ is generally safe, 8-bit weight-activation quantization is near-lossless with rotation-based methods (DuQuant, QuaRot) being superior to Simple approaches like SmoothQuant, and that instruct-tuned models are more robust to quantization than base models. The work provides practical guidance for efficient dLLM deployment and establishes a foundation for future improvements in PTQ for diffusion-based language models, with code available at the authors' repository.

Abstract

Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. Our code is publicly available at https://github.com/FelixMessi/QDLM.

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

TL;DR

The paper addresses the challenge of deploying diffusion-based LLMs on resource-constrained devices by systematically evaluating post-training quantization (PTQ) for dLLMs. It identifies activation outliers as a core difficulty for low-bit weight and activation quantization and benchmarks multiple PTQ methods across bit-widths, task types, and model variants (LLaDA-8B and Dream-7B). Key findings show that 4-bit weight-only quantization with GPTQ is generally safe, 8-bit weight-activation quantization is near-lossless with rotation-based methods (DuQuant, QuaRot) being superior to Simple approaches like SmoothQuant, and that instruct-tuned models are more robust to quantization than base models. The work provides practical guidance for efficient dLLM deployment and establishes a foundation for future improvements in PTQ for diffusion-based language models, with code available at the authors' repository.

Abstract

Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. Our code is publicly available at https://github.com/FelixMessi/QDLM.

Paper Structure

This paper contains 31 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Visualizations of activation outliers in LLaDA-8B-Base (1) and LLaDA-8B-Instruct (2). Outliers are observed at the inputs of various linear layers and can be classified as Normal Outliers (a(1)–c(1)/a(2)–c(2)), with relatively large magnitudes across tokens, and Massive Outliers (d(1), d(2)), with extremely large values on a few tokens. Notably, these massive outliers are identified at the second linear layer of the feed-forward network (FFN) module.
  • Figure 2: Visualizations of activation outliers in Dream-7B-Base. We observe relatively large normal outliers in the input to the FFN up-projection layer (c), while the massive outliers (d) exhibit smaller peak values compared to those in LLaDA models (Figure \ref{['fig:outlier_vis']}).
  • Figure B1: More visualizations of activation outliers in LLaDA-8B-Base.
  • Figure B2: More visualizations of activation outliers in LLaDA-8B-Instruct.