OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs
Jaeseong Lee, seung-won hwang, Aurick Qiao, Gabriele Oliaro, Ye Wang, Samyam Rajbhandari
TL;DR
This work shows that existing speculative decoding methods fail to generalize to long-context inputs, limiting practical speedups. It introduces LongSpecBench to benchmark long-context performance and OWL, a length-generalized speculative decoding framework with an LSTM-based drafter, a [SPEC] verifier token for richer representations, and a hybrid HOWL decoder that combines tree and non-tree strategies. OWL achieves roughly 4.0–4.3 acceptance length on long-context inputs, and HOWL reaches about 6.1, substantially outperforming EAGLE3, while delivering notable speedups. The results suggest that length-generalization, richer intermediate signals, and hybrid decoding are key for scalable acceleration of LLMs on long-context tasks, with practical impact for real-world, long-context workloads; the authors also release code and datasets to catalyze further research.
Abstract
Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.
