Table of Contents
Fetching ...

Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models

Tianao Zhang, Zhiteng Li, Xianglong Yan, Haotong Qin, Yong Guo, Yulun Zhang

TL;DR

Quant-dLLM tackles the challenge of ultra-low-bit (2-bit) post-training quantization for diffusion large language models (dLLMs), where timestep-dependent masking shifts activation statistics and denoising errors accumulate. It introduces three core components—Masked Calibration Simulation (MCS) to align calibration with diffusion masking, Data-aware Any-order Quantizer (DAQ) to express weights as a sum of binarized components with row-column scaling, and Adaptive Blockwise Mixed Precision (ABMP) to allocate bits across blocks under a fixed 2-bit budget. The results show state-of-the-art 2-bit weight-only accuracy across multiple dLLMs and tasks, significantly outperforming AR-transfer PTQ baselines and recovering a large portion of full-precision performance. This framework enables efficient, training-free deployment of dLLMs on resource-constrained hardware, with code to be released publicly.

Abstract

Diffusion large language models (dLLMs), which offer bidirectional context and flexible masked-denoising generation, are emerging as a compelling alternative to autoregressive (AR) LLMs. However, like AR LLMs, their model sizes continue to grow, motivating weight compression for deployment. Although post-training quantization (PTQ) is effective for AR LLMs, directly transferring it to dLLMs at 2-bit leads to unsatisfactory performance. To tackle these challenges, we propose Quant-dLLM, an ultra-low-bit PTQ framework tailored to dLLMs. Since masked-denoising activations in dLLMs differ from the fully visible signals assumed by standard PTQ methods, we introduce Masked Calibration Simulation (MCS) to align calibration with the timestep-dependent masking, which yields more reliable calibrations. Moreover, we propose a Data-aware Any-order Quantizer (DAQ) that learns ultra-low-bit weight representations via an optimization algorithm. It performs iterative approximation guided by our simulated calibration data. In addition, under a strict 2-bit budget, we introduce Adaptive Blockwise Mixed Precision (ABMP), a sensitivity-based precision allocation scheme that adaptively assigns bit width across channel groups. When restricted to 2-bit precision, Quant-dLLM consistently achieves higher accuracy than state-of-the-art (SOTA) AR-transfer PTQ methods on dLLMs. The code and models will be available at: https://github.com/ZTA2785/Quant-dLLM.

Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models

TL;DR

Quant-dLLM tackles the challenge of ultra-low-bit (2-bit) post-training quantization for diffusion large language models (dLLMs), where timestep-dependent masking shifts activation statistics and denoising errors accumulate. It introduces three core components—Masked Calibration Simulation (MCS) to align calibration with diffusion masking, Data-aware Any-order Quantizer (DAQ) to express weights as a sum of binarized components with row-column scaling, and Adaptive Blockwise Mixed Precision (ABMP) to allocate bits across blocks under a fixed 2-bit budget. The results show state-of-the-art 2-bit weight-only accuracy across multiple dLLMs and tasks, significantly outperforming AR-transfer PTQ baselines and recovering a large portion of full-precision performance. This framework enables efficient, training-free deployment of dLLMs on resource-constrained hardware, with code to be released publicly.

Abstract

Diffusion large language models (dLLMs), which offer bidirectional context and flexible masked-denoising generation, are emerging as a compelling alternative to autoregressive (AR) LLMs. However, like AR LLMs, their model sizes continue to grow, motivating weight compression for deployment. Although post-training quantization (PTQ) is effective for AR LLMs, directly transferring it to dLLMs at 2-bit leads to unsatisfactory performance. To tackle these challenges, we propose Quant-dLLM, an ultra-low-bit PTQ framework tailored to dLLMs. Since masked-denoising activations in dLLMs differ from the fully visible signals assumed by standard PTQ methods, we introduce Masked Calibration Simulation (MCS) to align calibration with the timestep-dependent masking, which yields more reliable calibrations. Moreover, we propose a Data-aware Any-order Quantizer (DAQ) that learns ultra-low-bit weight representations via an optimization algorithm. It performs iterative approximation guided by our simulated calibration data. In addition, under a strict 2-bit budget, we introduce Adaptive Blockwise Mixed Precision (ABMP), a sensitivity-based precision allocation scheme that adaptively assigns bit width across channel groups. When restricted to 2-bit precision, Quant-dLLM consistently achieves higher accuracy than state-of-the-art (SOTA) AR-transfer PTQ methods on dLLMs. The code and models will be available at: https://github.com/ZTA2785/Quant-dLLM.

Paper Structure

This paper contains 16 sections, 12 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: dLLMs' performance on 7 general tasks. Our Quant-dLLM yields the best accuracy at equal memory cost.
  • Figure 2: Overview of our Quant-dLLM. Masked Calibration Simulation: Aligns calibration with diffusion by simulating masked, timestep-aware inputs. Adaptive Blockwise Mixed Precision: Assigns binary orders by importance under a 2-bit average. Data-aware Any-order Quantizer: Builds multi-binary RC forms with data-aware optimization.
  • Figure 3: Average accuracy of mathematical & scientific reasoning, and code generation datasets on LLaDA series and Dream series.