Table of Contents
Fetching ...

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao

TL;DR

SuffixDecoding introduces a model-free speculative decoding framework tailored for agentic AI workloads with long, repetitive token sequences. By maintaining global and per-request suffix trees and greedily constructing a limited speculation tree, it adaptively extends or contracts token speculation based on pattern matches, enabling fast verification and high acceptance rates. The method demonstrates up to 5.3× speedups on AgenticSQL and strong gains on SWE-Bench, while also offering a hybrid path to pair with model-based approaches for mixed workloads. The work includes extensive evaluations, ablations, and real-system integration (vLLM/OpenHands), and provides open-source release for practical deployment and further research.

Abstract

Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3$\times$, outperforming state-of-the-art methods -- 2.8$\times$ faster than model-based approaches like EAGLE-2/3 and 1.9$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced at https://github.com/snowflakedb/ArcticInference

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

TL;DR

SuffixDecoding introduces a model-free speculative decoding framework tailored for agentic AI workloads with long, repetitive token sequences. By maintaining global and per-request suffix trees and greedily constructing a limited speculation tree, it adaptively extends or contracts token speculation based on pattern matches, enabling fast verification and high acceptance rates. The method demonstrates up to 5.3× speedups on AgenticSQL and strong gains on SWE-Bench, while also offering a hybrid path to pair with model-based approaches for mixed workloads. The work includes extensive evaluations, ablations, and real-system integration (vLLM/OpenHands), and provides open-source release for practical deployment and further research.

Abstract

Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3, outperforming state-of-the-art methods -- 2.8 faster than model-based approaches like EAGLE-2/3 and 1.9 faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced at https://github.com/snowflakedb/ArcticInference

Paper Structure

This paper contains 39 sections, 5 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of SuffixDecoding's algorithm. Two suffix trees track ongoing inference (top-left) and previous outputs (bottom-left). SuffixDecoding uses these trees to find matching patterns based on recently generated tokens. It constructs a speculation tree (middle) by selecting the most likely continuations, scoring them based on frequency statistics. Finally, the best candidate is verified by the LLM in a single forward pass (right), with accepted tokens (shown in green) being added to the output and used for the next round of speculation.
  • Figure 2: (a) the mean number of accepted tokens increases with the length of the pattern match, which motivates $\texttt{MAX\_SPEC} = \alpha p$. (b) shows that this choice achieves a better trade-off between acceptance rate and speculative speedup.
  • Figure 3: AgenticSQL is a multi-agent workflow consisting of stuctured generation, unstructured generation, and retrieval-augmented generation steps across several different LLMs. Useful features are extracted from the user question (Classify and Extract) and supplemented with retrieved context (Enrich). Several text-to-SQL steps propose solutions to the user question (SQL 1… N) in parallel with feedback from an error corrector. A last Combine step synthesizes the proposed SQL candidates into a final SQL query and text response.
  • Figure 4: Speculative speedups (top) and mean accepted tokens per step (bottom) compared to vanilla decoding for SuffixDecoding and baseline methods on three benchmarks: Spec-Bench, AgenticSQL, and SWE-Bench. Experiments use Llama-3.1-8B-Instruct on a single H100 GPU with batch size 1. Speedup is measured as the ratio of wall-clock time-per-output-token relative to vanilla decoding. Suffix (tree) and Hybrid (tree) use SuffixDecoding's tree speculation algorithm, which constructs a speculation tree from the suffix tree for parallel verification. Suffix (linear) and Hybrid (linear) use a simpler linear speculation approach that only allows sequential token chains. The hybrid variants combine SuffixDecoding with EAGLE-3, dynamically selecting between suffix-based and model-based speculation based on pattern match confidence. Note that EAGLE-2/3 and Token Recycling failed to run on several SWE-Bench tasks due to long context lengths (>8192 tokens), indicated by missing bars. Spec-Bench represents a non-agentic workload and is included for comparison. Further sub-task breakdowns, including the raw time-per-output-token and mean acceptance lengths, can be found in Appendix \ref{['sec:main-experiment-details']}.
  • Figure 5: A SuffixDecoding speculation tree containing 66 tokens for the AgenticSQL Extract task.
  • ...and 4 more figures