Table of Contents
Fetching ...

Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity

Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev

TL;DR

The paper tackles the inefficiency of autoregressive LLM inference by enabling adaptive self-speculative decoding that avoids fine-tuning. It introduces Adaptive Draft Model Generator (ADMG), which selects a draft subnetwork on the fly by removing attention layers according to cosine similarity thresholds and two structural rules. The approach is evaluated against Draft & Verify across Llama-2-13B, Llama-2-13B-Chat, and CodeLlama-13B on CNN/DM, XSUM, and HumanEval, showing competitive speedups, particularly on CodeLlama-13B. The work highlights robustness to calibration-data mismatch and presents a simple, plug-and-play alternative to black-box optimization in self-speculative decoding.

Abstract

We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.

Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity

TL;DR

The paper tackles the inefficiency of autoregressive LLM inference by enabling adaptive self-speculative decoding that avoids fine-tuning. It introduces Adaptive Draft Model Generator (ADMG), which selects a draft subnetwork on the fly by removing attention layers according to cosine similarity thresholds and two structural rules. The approach is evaluated against Draft & Verify across Llama-2-13B, Llama-2-13B-Chat, and CodeLlama-13B on CNN/DM, XSUM, and HumanEval, showing competitive speedups, particularly on CodeLlama-13B. The work highlights robustness to calibration-data mismatch and presents a simple, plug-and-play alternative to black-box optimization in self-speculative decoding.

Abstract

We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.
Paper Structure (8 sections, 5 tables, 1 algorithm)