S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs
Wei Zhong, Manasa Bharadwaj
TL;DR
This paper tackles the memory bottlenecks of speculative decoding (SD) for LLMs on low-memory GPUs by introducing Skippy Simultaneous Speculative Decoding (S3D). S3D combines mid-layer skipping with simultaneous multi-token predictions, using a mask-token MLM-style training objective and a draft model that shares layers with the target model to avoid extra VRAM costs. The authors formalize the speed-memory trade-off with an acceptance-rate model $\alpha(\beta;U)$ and a speed-up factor $IF(\gamma,\beta)$, and identify optimal hyper-parameters (e.g., symmetric middle-layer skipping and $\gamma \approx 4$). Empirical results show S3D attains one of the best memory-speed ratios among open-source SD methods, maintains effectiveness close to the baseline, and, when paired with Phi-3, can decode 1.4–2× faster than quantized EAGLE on GPUs like the A10G, while using less VRAM. This memory-efficient approach enables faster, cost-effective SD on affordable hardware and highlights the practical viability of self-speculative strategies under memory constraints.
Abstract
Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. However, despite the high speedups they offer, speculative decoding methods often achieve optimal performance on high-end devices or with a substantial GPU memory overhead. Given limited memory and the necessity of quantization, a high-performing model on a high-end GPU can slow down by up to 7 times. To this end, we propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. When compared against recent effective open-source SD systems, our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data. Leveraging our memory efficiency, we created a smaller yet more effective SD model based on Phi-3. It is 1.4 to 2 times faster than the quantized EAGLE model and operates in half-precision while using less VRAM.
