SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding
Authors
Jameson Sandler, Jacob K. Christopher, Thomas Hartvigsen, Ferdinando Fioretto
Abstract
Speculative decoding has become the standard approach for accelerating Large
Language Model (LLM) inference. It exploits a lossless draft-then-verify
procedure to circumvent the latency of autoregressive decoding, achieving
impressive speed-ups. Yet, current speculative decoding approaches remain
limited by two fundamental bottlenecks: (1) the autoregressive dependency
during drafting which limits parallelism, and (2) frequent rejections of draft
tokens caused by misalignment between the draft and verify models. This paper
proposes SpecDiff-2, a novel framework to jointly address these two
bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to
address bottleneck (1) and develops novel techniques to calibrate discrete
diffusion drafters with autoregressive verifiers, addressing bottleneck (2).
Experimental results across a comprehensive benchmark suite show that
SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and
mathematical benchmarks, improving tokens-per-second by up to an average of
+55% over previous baselines and obtaining up to 5.5x average speed-up over
standard decoding, without any loss of accuracy.