TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
Aditya Sridhar, Nish Sinnadurai, Sean Lie, Vithursan Thangarasa
TL;DR
TapOut presents an online, tuning-free dynamic speculative decoding framework that treats dynamic stopping as a multi-armed bandit decision problem over a pool of training-free strategies. By using sequence-level UCB1 (and other bandits) with a blended reward that balances accepted length and rate, TapOut achieves competitive or superior speedups to tuning-heavy baselines across diverse model families and prompts, while remaining interpretable through online arm values. The method reduces the need for hand-tuned thresholds and adapts to prompt distribution shifts, enabling robust throughput improvements in speculative decoding. Limitations include reliance on the quality of the chosen arms and the evaluation on relatively small datasets, suggesting opportunities for broader testing and extension to contextual bandits or richer reinforcement learning approaches.
Abstract
Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.
