Table of Contents
Fetching ...

OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs

Jaeseong Lee, seung-won hwang, Aurick Qiao, Gabriele Oliaro, Ye Wang, Samyam Rajbhandari

TL;DR

This work shows that existing speculative decoding methods fail to generalize to long-context inputs, limiting practical speedups. It introduces LongSpecBench to benchmark long-context performance and OWL, a length-generalized speculative decoding framework with an LSTM-based drafter, a [SPEC] verifier token for richer representations, and a hybrid HOWL decoder that combines tree and non-tree strategies. OWL achieves roughly 4.0–4.3 acceptance length on long-context inputs, and HOWL reaches about 6.1, substantially outperforming EAGLE3, while delivering notable speedups. The results suggest that length-generalization, richer intermediate signals, and hybrid decoding are key for scalable acceleration of LLMs on long-context tasks, with practical impact for real-world, long-context workloads; the authors also release code and datasets to catalyze further research.

Abstract

Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.

OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs

TL;DR

This work shows that existing speculative decoding methods fail to generalize to long-context inputs, limiting practical speedups. It introduces LongSpecBench to benchmark long-context performance and OWL, a length-generalized speculative decoding framework with an LSTM-based drafter, a [SPEC] verifier token for richer representations, and a hybrid HOWL decoder that combines tree and non-tree strategies. OWL achieves roughly 4.0–4.3 acceptance length on long-context inputs, and HOWL reaches about 6.1, substantially outperforming EAGLE3, while delivering notable speedups. The results suggest that length-generalization, richer intermediate signals, and hybrid decoding are key for scalable acceleration of LLMs on long-context tasks, with practical impact for real-world, long-context workloads; the authors also release code and datasets to catalyze further research.

Abstract

Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.

Paper Structure

This paper contains 26 sections, 4 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Llama3.1-8B-Instruct
  • Figure 2: Llama-3.3-70B-Instruct
  • Figure 4: EAGLE3 (left) and OWL (right).
  • Figure 5: SpecBench SpecBenchUnlocking2024xia
  • Figure 6: LongSpecBench (ours)
  • ...and 8 more figures