Table of Contents
Fetching ...

TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding

Aditya Sridhar, Nish Sinnadurai, Sean Lie, Vithursan Thangarasa

TL;DR

TapOut presents an online, tuning-free dynamic speculative decoding framework that treats dynamic stopping as a multi-armed bandit decision problem over a pool of training-free strategies. By using sequence-level UCB1 (and other bandits) with a blended reward that balances accepted length and rate, TapOut achieves competitive or superior speedups to tuning-heavy baselines across diverse model families and prompts, while remaining interpretable through online arm values. The method reduces the need for hand-tuned thresholds and adapts to prompt distribution shifts, enabling robust throughput improvements in speculative decoding. Limitations include reliance on the quality of the chosen arms and the evaluation on relatively small datasets, suggesting opportunities for broader testing and extension to contextual bandits or richer reinforcement learning approaches.

Abstract

Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.

TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding

TL;DR

TapOut presents an online, tuning-free dynamic speculative decoding framework that treats dynamic stopping as a multi-armed bandit decision problem over a pool of training-free strategies. By using sequence-level UCB1 (and other bandits) with a blended reward that balances accepted length and rate, TapOut achieves competitive or superior speedups to tuning-heavy baselines across diverse model families and prompts, while remaining interpretable through online arm values. The method reduces the need for hand-tuned thresholds and adapts to prompt distribution shifts, enabling robust throughput improvements in speculative decoding. Limitations include reliance on the quality of the chosen arms and the evaluation on relatively small datasets, suggesting opportunities for broader testing and extension to contextual bandits or richer reinforcement learning approaches.

Abstract

Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.

Paper Structure

This paper contains 20 sections, 3 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Dynamic speculative decoding with TapOut. After the draft model $M_q$ generates predictions for the next token, TapOut $\pi_\theta$ uses a bandit algorithm to select a dynamic speculation algorithm $f(\cdot)$ which returns a decision to stop or continue drafting. If $f(\cdot)$ chooses to stop drafting, the target model $M_p$ performs verification of all generated tokens and selects a subset that match its predictions. The accepted tokens are used to update the bandit policy.
  • Figure 2: Draft model $\sqrt{\mathcal{H}(p(x_t \mid x_{<t}))}$ by position $t$ for accepted tokens in responses to coding and non-coding prompts.
  • Figure 3: Comparison of speculated length across reward types. The accepted length reward ($r^{simple}$) causes the agent to aggressively speculate while the blended reward ($r^{blend}$) acts more conservatively.
  • Figure 4: Comparison of speedup between UCB1 and UCB-Tuned. UCB1 provides better performance across all prompt categories.
  • Figure 5: Progression of TapOut Sequence-level UCB1 $\mu_i$ for Llama-3 1B/8B on a) MT-Bench and b) HumanEval.
  • ...and 1 more figures