Table of Contents
Fetching ...

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang

TL;DR

This work proposes a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead, and can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens.

Abstract

Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked due to varying numbers of accepted tokens within a batch in the verification phase. Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples. However, this increases the computational and memory access overhead, thereby reducing the speedup ratio. We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead. Furthermore, our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens. Sufficient experiments demonstrate the efficacy of our method. Our code is available at https://github.com/niyunsheng/EMS-SD.

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

TL;DR

This work proposes a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead, and can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens.

Abstract

Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked due to varying numbers of accepted tokens within a batch in the verification phase. Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples. However, this increases the computational and memory access overhead, thereby reducing the speedup ratio. We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead. Furthermore, our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens. Sufficient experiments demonstrate the efficacy of our method. Our code is available at https://github.com/niyunsheng/EMS-SD.
Paper Structure (15 sections, 12 equations, 7 figures, 6 tables)

This paper contains 15 sections, 12 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Speedup ratio of Opt-6.7b on the CNN/Daily Mail Dataset for greedy settings when batch size $\ge 1$, utilizing LLMA yang2023inference as the basic speculative decoding method. Our method demonstrates superior performance to the vanilla method under varying batch sizes. The larger the batch size, the more pronounced the advantage of our method.
  • Figure 2: Our Method v.s. Vanilla Method. We specify the location of the KV cache for each sample individually, thus eliminating the necessity for the addition of padding to the KV cache. And we concatenate all input tokens of each sample into a single sequence without padding tokens when the number of prediction tokens differs between samples. Our method demonstrates superior performance than the vanilla method, without the need for additional computational and memory access overhead.
  • Figure 3: The detailed processing of unpad input tokens of decoding step 1 in Figure \ref{['fig:method']}. Sample 0 predicted 5 tokens, while sample 1 predicted 2 tokens. All tokens are concatenated before inference, and the sample/sequence index is restored when attention is computed within the CUDA kernels. Consequently, each token is aware of the specific KV caches to which it can utilize for parallel computation.
  • Figure 4: The average padding ratio utilizing the vanilla multi-sample method. The average padding ratio represents the amount of redundant computation, and an increase in this ratio will result in a proportional reduction in speedup.
  • Figure 5: The inference time of different numbers of input tokens per sample under different batch sizes. We set the number of existing tokens in each sample to 512. When the number of input tokens per sample is varied with a batch size of 4, the inference time remains essentially unchanged. However, when the batch size is increased to 16, the inference time changes significantly.
  • ...and 2 more figures