Table of Contents
Fetching ...

List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression

Joseph Rowan, Buu Phan, Ashish Khisti

TL;DR

This work introduces Gumbel-max List Sampling (GLS), a simple coordinated-sampling framework that enables multiple proposals from a draft distribution to be coupled with a target distribution without communication. By leveraging shared exponential random variables, GLS yields valid marginals and a provable lower bound (List Matching Lemma) on the probability that at least one draft matches the target sample, with the bound strengthening as the number of drafts $K$ grows. The authors instantiate GLS in two applications: (i) drafter-invariant multi-draft speculative decoding for large language models, producing competitive speedups and stronger invariance properties compared to existing schemes, and (ii) distributed lossy compression with independent side information at decoders, achieving notable rate-distortion gains on Gaussian sources and MNIST. The framework also yields theoretical bounds on token-level acceptance and reconstruction error, and is extensible via importance sampling to continuous spaces. Overall, GLS provides a principled, scalable approach to coordinated sampling with practical impact on inference acceleration and distributed compression, with code released for reproducibility.

Abstract

We study a relaxation of the problem of coupling probability distributions -- a list of samples is generated from one distribution and an accept is declared if any one of these samples is identical to the sample generated from the other distribution. We propose a novel method for generating samples, which extends the Gumbel-max sampling suggested in Daliri et al. (arXiv:2408.07978) for coupling probability distributions. We also establish a corresponding lower bound on the acceptance probability, which we call the list matching lemma. We next discuss two applications of our setup. First, we develop a new mechanism for multi-draft speculative sampling that is simple to implement and achieves performance competitive with baselines such as SpecTr and SpecInfer across a range of language tasks. Our method also guarantees a certain degree of drafter invariance with respect to the output tokens which is not supported by existing schemes. We also provide a theoretical lower bound on the token level acceptance probability. As our second application, we consider distributed lossy compression with side information in a setting where a source sample is compressed and available to multiple decoders, each with independent side information. We propose a compression technique that is based on our generalization of Gumbel-max sampling and show that it provides significant gains in experiments involving synthetic Gaussian sources and the MNIST image dataset.

List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression

TL;DR

This work introduces Gumbel-max List Sampling (GLS), a simple coordinated-sampling framework that enables multiple proposals from a draft distribution to be coupled with a target distribution without communication. By leveraging shared exponential random variables, GLS yields valid marginals and a provable lower bound (List Matching Lemma) on the probability that at least one draft matches the target sample, with the bound strengthening as the number of drafts grows. The authors instantiate GLS in two applications: (i) drafter-invariant multi-draft speculative decoding for large language models, producing competitive speedups and stronger invariance properties compared to existing schemes, and (ii) distributed lossy compression with independent side information at decoders, achieving notable rate-distortion gains on Gaussian sources and MNIST. The framework also yields theoretical bounds on token-level acceptance and reconstruction error, and is extensible via importance sampling to continuous spaces. Overall, GLS provides a principled, scalable approach to coordinated sampling with practical impact on inference acceleration and distributed compression, with code released for reproducibility.

Abstract

We study a relaxation of the problem of coupling probability distributions -- a list of samples is generated from one distribution and an accept is declared if any one of these samples is identical to the sample generated from the other distribution. We propose a novel method for generating samples, which extends the Gumbel-max sampling suggested in Daliri et al. (arXiv:2408.07978) for coupling probability distributions. We also establish a corresponding lower bound on the acceptance probability, which we call the list matching lemma. We next discuss two applications of our setup. First, we develop a new mechanism for multi-draft speculative sampling that is simple to implement and achieves performance competitive with baselines such as SpecTr and SpecInfer across a range of language tasks. Our method also guarantees a certain degree of drafter invariance with respect to the output tokens which is not supported by existing schemes. We also provide a theoretical lower bound on the token level acceptance probability. As our second application, we consider distributed lossy compression with side information in a setting where a source sample is compressed and available to multiple decoders, each with independent side information. We propose a compression technique that is based on our generalization of Gumbel-max sampling and show that it provides significant gains in experiments involving synthetic Gaussian sources and the MNIST image dataset.

Paper Structure

This paper contains 52 sections, 14 theorems, 118 equations, 6 figures, 14 tables, 2 algorithms.

Key Result

Proposition 1

The procedure described above (GLS) generates samples such that:

Figures (6)

  • Figure 1: Problem setting for lossy compression with side information at the decoder.
  • Figure 2: Experiments on a Gaussian source. (a)--(c): Matching probability, from left to right: GLS without side information, GLS with side information, baseline with side information. The baseline does not benefit from multiple decoders without side information. (d): Rate-distortion curves for GLS and baseline (BL) schemes.
  • Figure 3: Examples showing success and failure modes of our compression scheme on MNIST.
  • Figure 4: Rate-distortion curves on MNIST for GLS and baseline (BL) schemes.
  • Figure 5: Distributions used in the proof of \ref{['thm:lm_lemma']} in the case of $K=2$ and $N=2$.
  • ...and 1 more figures

Theorems & Definitions (24)

  • Proposition 1
  • Theorem 1: List matching lemma
  • Remark 1
  • Definition 1: Conditional drafter invariance
  • Proposition 2
  • Proposition 3
  • Theorem 2: Conditional LML
  • Proposition 4
  • Proposition \ref{prop:distribution}
  • proof
  • ...and 14 more