Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios
Luohe Shi, Zuchao Li, Lefei Zhang, Baoyuan Qi, Guoming Liu, Hai Zhao
TL;DR
This paper tackles the bottleneck of speculative decoding under large-batch inference by introducing SpecFormer, a novel architecture that blends Context Causal Attention with Draft Bi-directional Attention to enable parallel draft-token generation while leveraging full context. It grounds the approach in an arithmetic-intensity framework, defining metrics and constraints (e.g., $AI_m$, $AI_c$, $ ho$, $r_1$, $r_2$, and $\kappa$) to optimize draft efficiency without modifying the base LLM. Through lossless speculative-decoding experiments on 4B–14B models, SpecFormer demonstrates consistent acceleration at constrained draft budgets, aided by engineering improvements like Grouped RMS Norm and intra-batch gradient accumulation. The results show substantial throughput gains in large-batch scenarios and reveal how self-distillation and model size interact with acceleration, underscoring the method’s practicality for scalable, low-cost LLM inference. Overall, SpecFormer advances lossless SD by enabling high-accuracy, non-autoregressive draft generation that preserves correctness while accelerating inference in production-like batched settings.
Abstract
Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative decoding, as they compress the available idle computing power. Therefore, performing speculative decoding with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model's ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.
