SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications
Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao
TL;DR
SuffixDecoding introduces a model-free speculative decoding framework tailored for agentic AI workloads with long, repetitive token sequences. By maintaining global and per-request suffix trees and greedily constructing a limited speculation tree, it adaptively extends or contracts token speculation based on pattern matches, enabling fast verification and high acceptance rates. The method demonstrates up to 5.3× speedups on AgenticSQL and strong gains on SWE-Bench, while also offering a hybrid path to pair with model-based approaches for mixed workloads. The work includes extensive evaluations, ablations, and real-system integration (vLLM/OpenHands), and provides open-source release for practical deployment and further research.
Abstract
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3$\times$, outperforming state-of-the-art methods -- 2.8$\times$ faster than model-based approaches like EAGLE-2/3 and 1.9$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced at https://github.com/snowflakedb/ArcticInference
