Table of Contents
Fetching ...

SAM Decoding: Speculative Decoding via Suffix Automaton

Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, Jing Zhang

TL;DR

This work addresses latency in autoregressive LLM inference by introducing SAM-Decoding, a retrieval-based speculative decoding method that uses dual suffix automata (static from text corpus and dynamic from the current sequence) to locate the exact longest suffix match. It provides an amortized $O(1)$ time per generation step for suffix updates and draft retrieval, while remaining compatible with existing SD strategies. Empirically, SAM-Decoding achieves about 18%+ speedups over prior retrieval-based SD baselines on Spec-Bench and up to $11.13\%$ extra speedup when combined with EAGLE-2, with consistent gains across multiple backbone models and tasks. The approach broadens the applicability of speculative decoding beyond domain-specific tasks and offers a practical path to faster, scalable text generation, with code available at the project repository.

Abstract

Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration. Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on incomplete retrieval resources, inefficient retrieval methods, and are constrained to certain domains. This paper presents a novel retrieval-based speculative decoding method that adapts suffix automaton (SAM) for efficient and accurate draft generation by utilizing common text corpus and dynamic text sequence. Unlike existing $n$-gram matching methods, SAM-Decoding finds the exact longest suffix match, achieving an average time complexity of O(1) per generation step of SAM update and suffix retrieval. It can also integrate with existing methods, adaptively selecting a draft generation strategy based on match length to generalize to broader domains. Extensive experiments on Spec-Bench show that our method is $18\%+$ faster than other retrieval-based SD methods. Additionally, when combined with advanced EAGLE-2, it provides an additional speedup of $3.28\%$ -- $11.13\%$ across various-sized LLM backbones. Our code is available at our \href{https://github.com/hyx1999/SAM-Decoding}{repository}.

SAM Decoding: Speculative Decoding via Suffix Automaton

TL;DR

This work addresses latency in autoregressive LLM inference by introducing SAM-Decoding, a retrieval-based speculative decoding method that uses dual suffix automata (static from text corpus and dynamic from the current sequence) to locate the exact longest suffix match. It provides an amortized time per generation step for suffix updates and draft retrieval, while remaining compatible with existing SD strategies. Empirically, SAM-Decoding achieves about 18%+ speedups over prior retrieval-based SD baselines on Spec-Bench and up to extra speedup when combined with EAGLE-2, with consistent gains across multiple backbone models and tasks. The approach broadens the applicability of speculative decoding beyond domain-specific tasks and offers a practical path to faster, scalable text generation, with code available at the project repository.

Abstract

Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration. Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on incomplete retrieval resources, inefficient retrieval methods, and are constrained to certain domains. This paper presents a novel retrieval-based speculative decoding method that adapts suffix automaton (SAM) for efficient and accurate draft generation by utilizing common text corpus and dynamic text sequence. Unlike existing -gram matching methods, SAM-Decoding finds the exact longest suffix match, achieving an average time complexity of O(1) per generation step of SAM update and suffix retrieval. It can also integrate with existing methods, adaptively selecting a draft generation strategy based on match length to generalize to broader domains. Extensive experiments on Spec-Bench show that our method is faster than other retrieval-based SD methods. Additionally, when combined with advanced EAGLE-2, it provides an additional speedup of -- across various-sized LLM backbones. Our code is available at our \href{https://github.com/hyx1999/SAM-Decoding}{repository}.

Paper Structure

This paper contains 19 sections, 11 equations, 9 figures, 9 tables, 4 algorithms.

Figures (9)

  • Figure 1: Throughput of Vicuna-7B, Vicuna-13B, Vicuna-33B on MT-Bench with A6000 GPU using PLD, Token  Recycling token-recycle-luo-2024, EAGLE-2, and SAM-Decoding, where PLD is the SOTA retrieval-based SD baseline.
  • Figure 2: The suffix automaton corresponding to the string "ABCBC".
  • Figure 3: Overview of SAM-Decoding's workflow. In each round of generation, the suffix automaton matches the suffixes of the generating text and retrieves the draft from the text corpus and the generated text respectively according to the matching position. Our method can be combined with an auxiliary SD algorithm (Auxiliary) to deal with the scenarios where the retrieval is not applicable. We select the best draft from the three candidate drafts based on the match length, and then the drafts are verified by the LLM for accepted tokens. Using these accepted tokens, we finally extend the dynamic SAM and generate text for the next round of generation.
  • Figure 4: Relative speedup of SAM-Decoding compared to retrieval-based SD baselines on Spec-Bench.
  • Figure 5: Relative speedup of SAM-Decoding compared to SD baselines on Spec-Bench when combined with auxiliary SD methods.
  • ...and 4 more figures