Table of Contents
Fetching ...

Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu

TL;DR

This work introduces CadLLM, a training-free, confidence-aware calibration framework that adaptively tunes diffusion-based LLM decoding. By analyzing per-block and per-step confidence, CadLLM dynamically adjusts block size, refinement steps, vocabulary subset, and unmasking threshold, while also using a repetition detector to preserve diversity. The method is KV-cache compatible and demonstrates up to 2.28x throughput gains with competitive accuracy against state-of-the-art baselines across multiple benchmarks and generation lengths. These results highlight the practical potential of training-free, adaptive control to accelerate diffusion-based language generation in real-world deployments.

Abstract

We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

TL;DR

This work introduces CadLLM, a training-free, confidence-aware calibration framework that adaptively tunes diffusion-based LLM decoding. By analyzing per-block and per-step confidence, CadLLM dynamically adjusts block size, refinement steps, vocabulary subset, and unmasking threshold, while also using a repetition detector to preserve diversity. The method is KV-cache compatible and demonstrates up to 2.28x throughput gains with competitive accuracy against state-of-the-art baselines across multiple benchmarks and generation lengths. These results highlight the practical potential of training-free, adaptive control to accelerate diffusion-based language generation in real-world deployments.

Abstract

We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

Paper Structure

This paper contains 27 sections, 6 equations, 2 figures, 11 tables, 1 algorithm.

Figures (2)

  • Figure 1: (a) Confidence dynamics for three difference datasets. (b) Latency vs. vocabulary size.
  • Figure 2: Overview of CadLLM’s adaptive controller. The controller dynamically updates various parameters based on a lightweight confidence and progress signals, replacing the static ones.