List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression
Joseph Rowan, Buu Phan, Ashish Khisti
TL;DR
This work introduces Gumbel-max List Sampling (GLS), a simple coordinated-sampling framework that enables multiple proposals from a draft distribution to be coupled with a target distribution without communication. By leveraging shared exponential random variables, GLS yields valid marginals and a provable lower bound (List Matching Lemma) on the probability that at least one draft matches the target sample, with the bound strengthening as the number of drafts $K$ grows. The authors instantiate GLS in two applications: (i) drafter-invariant multi-draft speculative decoding for large language models, producing competitive speedups and stronger invariance properties compared to existing schemes, and (ii) distributed lossy compression with independent side information at decoders, achieving notable rate-distortion gains on Gaussian sources and MNIST. The framework also yields theoretical bounds on token-level acceptance and reconstruction error, and is extensible via importance sampling to continuous spaces. Overall, GLS provides a principled, scalable approach to coordinated sampling with practical impact on inference acceleration and distributed compression, with code released for reproducibility.
Abstract
We study a relaxation of the problem of coupling probability distributions -- a list of samples is generated from one distribution and an accept is declared if any one of these samples is identical to the sample generated from the other distribution. We propose a novel method for generating samples, which extends the Gumbel-max sampling suggested in Daliri et al. (arXiv:2408.07978) for coupling probability distributions. We also establish a corresponding lower bound on the acceptance probability, which we call the list matching lemma. We next discuss two applications of our setup. First, we develop a new mechanism for multi-draft speculative sampling that is simple to implement and achieves performance competitive with baselines such as SpecTr and SpecInfer across a range of language tasks. Our method also guarantees a certain degree of drafter invariance with respect to the output tokens which is not supported by existing schemes. We also provide a theoretical lower bound on the token level acceptance probability. As our second application, we consider distributed lossy compression with side information in a setting where a source sample is compressed and available to multiple decoders, each with independent side information. We propose a compression technique that is based on our generalization of Gumbel-max sampling and show that it provides significant gains in experiments involving synthetic Gaussian sources and the MNIST image dataset.
