Table of Contents
Fetching ...

The Disparate Impacts of Speculative Decoding

Jameson Sandler, Ahmet Üstün, Marco Romanelli, Sara Hooker, Ferdinando Fioretto

TL;DR

The paper analyzes speculative decoding in large language models and shows that speed-ups are uneven across tasks and languages due to drafter-verifier misalignment, formalizing this unfairness with a cross-entropy–based metric and variational bounds. It demonstrates a strong link between drafter task-fitness and acceleration, predicting slower speed-ups for under-fit or underrepresented tasks. The authors propose stochastic corrective drafter finetuning (s-CDF) to mitigate disparities by updating only the drafter, achieving an average 12% improvement in the fairness metric and reduced speed-up variance across model pairs. The work provides theoretical and empirical support for acceleration parity as a practical objective, along with a scalable mitigation method that preserves verifier behavior for fair multilingual deployment.

Abstract

The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.

The Disparate Impacts of Speculative Decoding

TL;DR

The paper analyzes speculative decoding in large language models and shows that speed-ups are uneven across tasks and languages due to drafter-verifier misalignment, formalizing this unfairness with a cross-entropy–based metric and variational bounds. It demonstrates a strong link between drafter task-fitness and acceleration, predicting slower speed-ups for under-fit or underrepresented tasks. The authors propose stochastic corrective drafter finetuning (s-CDF) to mitigate disparities by updating only the drafter, achieving an average 12% improvement in the fairness metric and reduced speed-up variance across model pairs. The work provides theoretical and empirical support for acceleration parity as a practical objective, along with a scalable mitigation method that preserves verifier behavior for fair multilingual deployment.

Abstract

The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.

Paper Structure

This paper contains 38 sections, 21 equations, 11 figures, 1 algorithm.

Figures (11)

  • Figure 1: Acceptance rates and accuracies in MGSM using Qwen2.5 series drafter (0.5B) and verifier (3B).
  • Figure 2: Relation between drafter task-fitness $(1-r_q)$ and task speed-up / acceptance rates $\alpha$.
  • Figure 3: Divergence $\bm{D}(\cdot)$ is high (low) for slower (faster) language.
  • Figure 4: Alpha against task accuracy within different languages on MGSM data.
  • Figure 5: Acceptance rates for Qwen2.5-0.5B, 3B model pair, on larger MCoT dataset.
  • ...and 6 more figures

Theorems & Definitions (5)

  • proof
  • proof
  • proof
  • proof
  • proof