Table of Contents
Fetching ...

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

Chengbo Liu, Yong Zhu

TL;DR

This work tackles latency in autoregressive LLM inference by introducing semantic adaptive tokens (SDSAT) that enable high-quality drafting without modifying model architecture. It presents a two-step draft-then-verify framework compatible with both greedy and nucleus sampling, supported by a training scheme that injects adaptive tokens with minimal distortion to standard token predictions. Experiments on CodeLlama-7B and -13B show speedups above 3x across multiple tasks (Python code, multilingual coding, and infilling) while maintaining near-original accuracy; the approach works with diverse tokens and both inference strategies, with larger models benefiting more. The practical impact is a broadly transferable acceleration method that can retrofit existing LLMs to faster draft-and-verify decoding, reducing latency for real-time or batch generation scenarios without additional external databases or model redesigns.

Abstract

We propose an acceleration scheme for large language models (LLMs) through Speculative Decoding with Semantic Adaptive Tokens (SDSAT). The primary objective of this design is to enhance the LLM model's ability to generate draft tokens more accurately without compromising the model's accuracy. The core strategies involve: 1) Fine-tune the model by incorporating semantic adaptive tokens that possess flexible decoding capabilities without changing its structure, allowing them to generate high-quality draft tokens. 2) By employing a training method that does not affect the standard tokens, the model can acquire parallel decoding abilities atop its original framework with minimal training overhead. 3) We have designed the "two-step-draft-then-verify" generation strategies using both greedy search and nucleus sampling. Experiments conducted on the CodeLlama-13B and 7B models have yielded speed increases of over 3.5X and 3.0X, respectively. Please refer to https://github.com/hasuoshenyun/SDSAT.

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

TL;DR

This work tackles latency in autoregressive LLM inference by introducing semantic adaptive tokens (SDSAT) that enable high-quality drafting without modifying model architecture. It presents a two-step draft-then-verify framework compatible with both greedy and nucleus sampling, supported by a training scheme that injects adaptive tokens with minimal distortion to standard token predictions. Experiments on CodeLlama-7B and -13B show speedups above 3x across multiple tasks (Python code, multilingual coding, and infilling) while maintaining near-original accuracy; the approach works with diverse tokens and both inference strategies, with larger models benefiting more. The practical impact is a broadly transferable acceleration method that can retrofit existing LLMs to faster draft-and-verify decoding, reducing latency for real-time or batch generation scenarios without additional external databases or model redesigns.

Abstract

We propose an acceleration scheme for large language models (LLMs) through Speculative Decoding with Semantic Adaptive Tokens (SDSAT). The primary objective of this design is to enhance the LLM model's ability to generate draft tokens more accurately without compromising the model's accuracy. The core strategies involve: 1) Fine-tune the model by incorporating semantic adaptive tokens that possess flexible decoding capabilities without changing its structure, allowing them to generate high-quality draft tokens. 2) By employing a training method that does not affect the standard tokens, the model can acquire parallel decoding abilities atop its original framework with minimal training overhead. 3) We have designed the "two-step-draft-then-verify" generation strategies using both greedy search and nucleus sampling. Experiments conducted on the CodeLlama-13B and 7B models have yielded speed increases of over 3.5X and 3.0X, respectively. Please refer to https://github.com/hasuoshenyun/SDSAT.
Paper Structure (19 sections, 9 equations, 7 figures, 7 tables)

This paper contains 19 sections, 9 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Diagram of the inference mechanisms, while the left figure depicts the greedy search process, which is a special pattern of the nucleus sampling process shown in the right figure. A green checkmark indicates a token is accepted, and a cross indicates a token is rejected.
  • Figure 2: An example of the "two-step-draft-then-verifiy" process using the greedy search generation strategy, with [32011], [32012], [32013] as the adaptive tokens selected for the CodeLlama model. Each loop consists of three steps: after two drafting steps, the third step is verification. In the diagram, the tokens "by" and "fostering" in loop1 do not match, therefore the verified accepted token for loop1 ends at "by". The second loop passes all verifications, hence the results generated by all adaptive tokens are accepted.
  • Figure 3: Performance of SDSAT-7B (L=5). Left: Accept rate, which means the average number of tokens accepted divided by the number of adaptive tokens. Right: Tokens per second of the generated new tokens.
  • Figure 4: Performance of SDSAT-13B (L=7). Left: Accept rate. Right: Tokens per second of the generated new tokens
  • Figure 5: The loss curves of standard tokens corresponding to two different training methods. The loss of standard tokens is calculated by excluding the loss associated with adaptive tokens and computing the average loss across all standard tokens.
  • ...and 2 more figures