Speculative Sampling via Exponential Races
Szymon Kobus, Deniz Gündüz
TL;DR
The paper addresses accelerating autoregressive text generation by linking speculative decoding to channel simulation and introducing Exponential Race Speculative Decoding (ERSD). It shows that the speed-up from drafting is governed by the entropy of the acceptance distribution $H[R]$, with a lower bound via $D_{KL}(P||Q)$ and an upper bound arising from channel-simulation schemes, yielding the bound $\mathbb{E}[\#generated] \le (\log|\Omega| + \log(k+1)) / H[R]$. It develops an optimal drafting-tree algorithm with $O(k\log k)$ complexity and connects the drafting process to Tunstall coding, providing an asymptotic interpretation of speed-ups in terms of source compression. Empirically, ERSD matches state-of-the-art speculative decoding performance and reveals that higher-order dependencies can influence practical outcomes, highlighting both theoretical insight and avenues for further improvement.
Abstract
Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens $k$ generated by the draft model for large $k$, which serves as an upper bound for all $k$. We also propose a novel speculative decoding method via exponential race ERSD that matches state-of-the-art performance.
