The Disparate Impacts of Speculative Decoding
Jameson Sandler, Ahmet Üstün, Marco Romanelli, Sara Hooker, Ferdinando Fioretto
TL;DR
The paper analyzes speculative decoding in large language models and shows that speed-ups are uneven across tasks and languages due to drafter-verifier misalignment, formalizing this unfairness with a cross-entropy–based metric and variational bounds. It demonstrates a strong link between drafter task-fitness and acceleration, predicting slower speed-ups for under-fit or underrepresented tasks. The authors propose stochastic corrective drafter finetuning (s-CDF) to mitigate disparities by updating only the drafter, achieving an average 12% improvement in the fairness metric and reduced speed-up variance across model pairs. The work provides theoretical and empirical support for acceleration parity as a practical objective, along with a scalable mitigation method that preserves verifier behavior for fair multilingual deployment.
Abstract
The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.
