Table of Contents
Fetching ...

S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs

Wei Zhong, Manasa Bharadwaj

TL;DR

This paper tackles the memory bottlenecks of speculative decoding (SD) for LLMs on low-memory GPUs by introducing Skippy Simultaneous Speculative Decoding (S3D). S3D combines mid-layer skipping with simultaneous multi-token predictions, using a mask-token MLM-style training objective and a draft model that shares layers with the target model to avoid extra VRAM costs. The authors formalize the speed-memory trade-off with an acceptance-rate model $\alpha(\beta;U)$ and a speed-up factor $IF(\gamma,\beta)$, and identify optimal hyper-parameters (e.g., symmetric middle-layer skipping and $\gamma \approx 4$). Empirical results show S3D attains one of the best memory-speed ratios among open-source SD methods, maintains effectiveness close to the baseline, and, when paired with Phi-3, can decode 1.4–2× faster than quantized EAGLE on GPUs like the A10G, while using less VRAM. This memory-efficient approach enables faster, cost-effective SD on affordable hardware and highlights the practical viability of self-speculative strategies under memory constraints.

Abstract

Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. However, despite the high speedups they offer, speculative decoding methods often achieve optimal performance on high-end devices or with a substantial GPU memory overhead. Given limited memory and the necessity of quantization, a high-performing model on a high-end GPU can slow down by up to 7 times. To this end, we propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. When compared against recent effective open-source SD systems, our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data. Leveraging our memory efficiency, we created a smaller yet more effective SD model based on Phi-3. It is 1.4 to 2 times faster than the quantized EAGLE model and operates in half-precision while using less VRAM.

S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs

TL;DR

This paper tackles the memory bottlenecks of speculative decoding (SD) for LLMs on low-memory GPUs by introducing Skippy Simultaneous Speculative Decoding (S3D). S3D combines mid-layer skipping with simultaneous multi-token predictions, using a mask-token MLM-style training objective and a draft model that shares layers with the target model to avoid extra VRAM costs. The authors formalize the speed-memory trade-off with an acceptance-rate model and a speed-up factor , and identify optimal hyper-parameters (e.g., symmetric middle-layer skipping and ). Empirical results show S3D attains one of the best memory-speed ratios among open-source SD methods, maintains effectiveness close to the baseline, and, when paired with Phi-3, can decode 1.4–2× faster than quantized EAGLE on GPUs like the A10G, while using less VRAM. This memory-efficient approach enables faster, cost-effective SD on affordable hardware and highlights the practical viability of self-speculative strategies under memory constraints.

Abstract

Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. However, despite the high speedups they offer, speculative decoding methods often achieve optimal performance on high-end devices or with a substantial GPU memory overhead. Given limited memory and the necessity of quantization, a high-performing model on a high-end GPU can slow down by up to 7 times. To this end, we propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. When compared against recent effective open-source SD systems, our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data. Leveraging our memory efficiency, we created a smaller yet more effective SD model based on Phi-3. It is 1.4 to 2 times faster than the quantized EAGLE model and operates in half-precision while using less VRAM.
Paper Structure (17 sections, 11 equations, 10 figures, 4 tables)

This paper contains 17 sections, 11 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Training efficiency, inference efficiency per memory unit, and load-time VRAM evaluated for different models on MT-Bench. From left to right: The most recent open-source SD systems ordered by release dates. All systems use 7B target models with 8-bit quantization. Our model (S3D) stands out in both training efficiency and memory-speed trade-offs.
  • Figure 2: An illustration of S3D based on simultaneous predictions of the last $\gamma$ tokens ($\gamma=2$). A mask token <M> is added into vocabulary prior to training, and a partial model is trained to predict the next tokens simultaneously. Tree attention is adopted to verify multiple branches of predictions give top candidates of the $k$-th draft token. Unlike other self-speculative decoding methods based on fully-skipped layers, we only skip the middle layers on top of the draft tokens so that the draft model can access high-level features from top layers as well as the previous states verified by the complete target model.
  • Figure 3: Speed comparison between ours (S3D) and EAGLE on different GPU devices (MT-Bench samples, 7B LLaMA target model). The dashed bars represents the full speed potentials of the EAGLE model without memory restrictions. However, when constrained with a VRAM limit of 16 GiB, the quantized EAGLE model (indicated by red bars) suffers from severe speed degradation, highlighting the significant overheads associated with quantization.
  • Figure 4: The overall acceptance rates and individual acceptance rates at different drafting depths (w/ only a single branch of future tokens). L, LMH, and Emb stand for regular layer, LM heads, and the embedding layer, respectively. Skipping the middle layers symmetrically has shown better acceptance rates in general. Note that we distinguish embedding layer and lm_head here although in practice they may have tied weights.
  • Figure 5: Upper: The predicted (in dashes) and sampled acceptance rates (interpolated orange dots) of various draft model sizes ($\beta$). Lower: The predicted (in curves) and sampled (in dots) speeds of different draft model sizes and different number of guesses ($\gamma$). Our prediction curves justify the optimality of using around half the number of parameters and $\gamma=4$, as observed individually and respectively in zhang2023selfspec_draftverify and gloeckle2024betterandfaster.
  • ...and 5 more figures