Table of Contents
Fetching ...

BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Y. F. Tan, Zhuoran Yang

TL;DR

BanditSpec introduces a training-free online learning framework that casts adaptive speculative decoding as a multi-armed bandit problem, enabling dynamic hyperparameter selection to accelerate LLM inference. It delivers two algorithms, UCBSpec for stochastic rewards and EXP3Spec for adversarial rewards, with rigorous stopping-time regret bounds and, in the stochastic case, an information-theoretic lower bound that attains near-optimality. Empirical results on LLaMA3 and Qwen2 show adaptive methods significantly improve throughput and latency across prompts and hardware, approaching oracle-best performance without retraining. The work offers a practical, deployable approach to accelerate inference under realistic conditions and motivates future extensions to structured, robust, and contextual bandits for even broader applicability.

Abstract

Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

TL;DR

BanditSpec introduces a training-free online learning framework that casts adaptive speculative decoding as a multi-armed bandit problem, enabling dynamic hyperparameter selection to accelerate LLM inference. It delivers two algorithms, UCBSpec for stochastic rewards and EXP3Spec for adversarial rewards, with rigorous stopping-time regret bounds and, in the stochastic case, an information-theoretic lower bound that attains near-optimality. Empirical results on LLaMA3 and Qwen2 show adaptive methods significantly improve throughput and latency across prompts and hardware, approaching oracle-best performance without retraining. The work offers a practical, deployable approach to accelerate inference under realistic conditions and motivates future extensions to structured, robust, and contextual bandits for even broader applicability.

Abstract

Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

Paper Structure

This paper contains 36 sections, 23 theorems, 106 equations, 3 figures, 4 tables, 11 algorithms.

Key Result

Proposition 3.1

For any arm selection algorithm $\mathtt{ALG}$ that selects an arm according to the history, the generated prompt $\mathrm{pt}_{\mathrm{ST}(\mathtt{ALG})}$ is equal to $\mathrm{pt}_{\tau_{\mathrm{c}}}$in distribution, i.e., The stopping time $\mathrm{ST}(\mathtt{ALG})$ can be bounded as

Figures (3)

  • Figure 1: Given the prefix tokens and the candidate hyperparameter configurations (e.g., models), which configuration should be selected to decode the next tokens? We formulate this problem as a bandit problem and propose a general framework BanditSpec.
  • Figure 2: Illustration of our bandit model for choosing configurations to decode the next token, where UCB and EXP3 refer to UCBSpec and EXP3Spec, respectively.
  • Figure 3: We compare throughtput improvements with different speculative decoding lengths $\gamma\in[4]$ and the canonical decoding ($\gamma=0$). The performance of UCBSpec approaches that of the best hyperparameter across all samples for both target models LLaMA3 and Qwen2. The sample indices are sorted according to the best arm improvement for a clear demonstration.

Theorems & Definitions (34)

  • Proposition 3.1
  • Theorem 4.3: Upper Bound
  • Theorem 4.4: Lower Bound
  • Proposition 4.4: Tightness Result
  • Theorem 5.3
  • Remark 6.1
  • Theorem 4.1: Upper Bound
  • proof : Proof of Theorem \ref{['thm:stoc_up']}
  • Theorem 4.1
  • proof : Proof of Theorem \ref{['thm:adv_UpBd']}
  • ...and 24 more