Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity
Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev
TL;DR
The paper tackles the inefficiency of autoregressive LLM inference by enabling adaptive self-speculative decoding that avoids fine-tuning. It introduces Adaptive Draft Model Generator (ADMG), which selects a draft subnetwork on the fly by removing attention layers according to cosine similarity thresholds and two structural rules. The approach is evaluated against Draft & Verify across Llama-2-13B, Llama-2-13B-Chat, and CodeLlama-13B on CNN/DM, XSUM, and HumanEval, showing competitive speedups, particularly on CodeLlama-13B. The work highlights robustness to calibration-data mismatch and presents a simple, plug-and-play alternative to black-box optimization in self-speculative decoding.
Abstract
We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.
