Table of Contents
Fetching ...

CDLM: Consistency Diffusion Language Models For Faster Sampling

Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami

TL;DR

CDLMClip presents a training-based acceleration for diffusion language models by integrating consistency modeling with a block-wise causal fine-tuning regime. The method distills bidirectional teacher guidance into a block-wise student, and enforces within-block consistency across decoding steps, enabling multi-token finalization. By collecting offline teacher trajectories and optimizing distillation, consistency, and DLM losses, CDLM achieves substantial latency reductions (up to ~14.5x) and step reductions (up to ~7.9x) while maintaining competitive accuracy on math and coding benchmarks. The approach also enables cache-friendly inference via block-wise KV caching and confidence-thresholded decoding, offering practical improvements for open-source DLMs and suggesting directions for scaling with larger teachers and datasets.

Abstract

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.

CDLM: Consistency Diffusion Language Models For Faster Sampling

TL;DR

CDLMClip presents a training-based acceleration for diffusion language models by integrating consistency modeling with a block-wise causal fine-tuning regime. The method distills bidirectional teacher guidance into a block-wise student, and enforces within-block consistency across decoding steps, enabling multi-token finalization. By collecting offline teacher trajectories and optimizing distillation, consistency, and DLM losses, CDLM achieves substantial latency reductions (up to ~14.5x) and step reductions (up to ~7.9x) while maintaining competitive accuracy on math and coding benchmarks. The approach also enables cache-friendly inference via block-wise KV caching and confidence-thresholded decoding, offering practical improvements for open-source DLMs and suggesting directions for scaling with larger teachers and datasets.

Abstract

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.

Paper Structure

This paper contains 63 sections, 7 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: Left: Block-wise decoding trajectory of the teacher (steps $0\!\to\!N$; diffusion time $t:1\!\to\!0$). Right: The student's three-objective loss at an intermediate state $y$: (i) distillation from teacher logits on newly unmasked positions, (ii) consistency between $y$ and its block-completion $y^{\star}$, and (iii) masked-denoising (DLM) loss on randomly masked ground-truth text.
  • Figure 2: Left: Teacher DLM with full bidirectional attention, attending to the entire context. Right: Student DLM with a block-wise causal mask, attending to the prompt, previously completed blocks, and the current decoding block.
  • Figure 3: Throughput vs. AR (Dream). Tokens-per-second on GSM8K-CoT and MBPP-Instruct for Dream-7B-Instruct (naive), Qwen2.5-7B-Instruct (AR), and CDLM--Dream.
  • Figure 4: Throughput vs. AR (LLaDA). Tokens-per-second on GSM8K and HumanEval for LLaDA-8B-Instruct (naive), Llama3.1-8B-Instruct (AR), and CDLM--LLaDA.
  • Figure 5: Teacher outputs vs. temperature. Final outputs from LLaDA-8B-Instruct at sampling temperatures $\tau \in \{0.0,\,0.5,\,1.0\}$. Answers are marked with boxed{$\cdot$} (blue = correct; red = incorrect).
  • ...and 1 more figures