Table of Contents
Fetching ...

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li

TL;DR

This work tackles the efficiency bottleneck of autoregressive LLM inference by introducing SWIFT, a plug-and-play, on-the-fly self-speculative decoding method that adaptively skips intermediate LLM layers to draft tokens without additional training or auxiliary modules. SWIFT comprises two core innovations: context-based layer set optimization, which uses LLM-generated context to identify the optimal skipped-layer configuration for the current input stream, and confidence-aware inference acceleration, which prunes drafting and dynamically expands draft candidates based on a calibrated confidence score. Empirical results across LLaMA-2, CodeLLaMA, and other backbones on tasks such as summarization, reasoning, storytelling, and code generation show consistent $1.3\sim1.6\times$ wall-clock speedups with high draft acceptance ($90\%\sim100\%$) and near-distribution-preserving outputs. The approach requires no training or extra parameters, enabling broad applicability to dynamic data streams and various models, and demonstrates favorable behavior under domain shifts and scaling laws. The work provides a practical, generalizable pathway for accelerating LLM inference in real-world settings while maintaining output fidelity.

Abstract

Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to verify them in parallel. While this technique has achieved notable speedups, most existing approaches necessitate either additional parameters or extensive training to construct effective draft models, thereby restricting their applicability across different LLMs and tasks. To address this limitation, we explore a novel plug-and-play SD solution with layer-skipping, which skips intermediate layers of the target LLM as the compact draft model. Our analysis reveals that LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. Building on these insights, we introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. SWIFT does not require auxiliary models or additional training, making it a plug-and-play solution for accelerating LLM inference across diverse input data streams. Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text. We release our code in https://github.com/hemingkx/SWIFT.

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

TL;DR

This work tackles the efficiency bottleneck of autoregressive LLM inference by introducing SWIFT, a plug-and-play, on-the-fly self-speculative decoding method that adaptively skips intermediate LLM layers to draft tokens without additional training or auxiliary modules. SWIFT comprises two core innovations: context-based layer set optimization, which uses LLM-generated context to identify the optimal skipped-layer configuration for the current input stream, and confidence-aware inference acceleration, which prunes drafting and dynamically expands draft candidates based on a calibrated confidence score. Empirical results across LLaMA-2, CodeLLaMA, and other backbones on tasks such as summarization, reasoning, storytelling, and code generation show consistent wall-clock speedups with high draft acceptance () and near-distribution-preserving outputs. The approach requires no training or extra parameters, enabling broad applicability to dynamic data streams and various models, and demonstrates favorable behavior under domain shifts and scaling laws. The work provides a practical, generalizable pathway for accelerating LLM inference in real-world settings while maintaining output fidelity.

Abstract

Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to verify them in parallel. While this technique has achieved notable speedups, most existing approaches necessitate either additional parameters or extensive training to construct effective draft models, thereby restricting their applicability across different LLMs and tasks. To address this limitation, we explore a novel plug-and-play SD solution with layer-skipping, which skips intermediate layers of the target LLM as the compact draft model. Our analysis reveals that LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. Building on these insights, we introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. SWIFT does not require auxiliary models or additional training, making it a plug-and-play solution for accelerating LLM inference across diverse input data streams. Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text. We release our code in https://github.com/hemingkx/SWIFT.

Paper Structure

This paper contains 50 sections, 6 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Illustration of prior solution and ours for plug-and-play SD. (a) Jacobi-based drafting appends multiple pseudo tokens to the input prompt, enabling the target LLM to generate multiple tokens as drafts in a single step. (b) Swift adopts sparsity-based drafting, which exploits the inherent sparsity in LLMs to facilitate efficient drafting. This work is the first exploration of plug-and-play SD using sparsity-based drafting.
  • Figure 2: (a) LLMs possess self-acceleration potential via layer sparsity. By utilizing drafts from the top-$k$ candidates, we found that uniformly skipping half of the layers during drafting yields a notable $1.2\times$ speedup. (b) Layer sparsity is task-specific. Each task requires its own optimal set of skipped layers, and applying the skipped layer configuration from one task to another can lead to substantial performance degradation. "X LS" represents the skipped layer set optimized for task X.
  • Figure 3: Timeline of Swift inference. N denotes the maximum generation length per instance.
  • Figure 4: Layer set optimization process in Swift. During the optimization stage, Swift performs an optimization step prior to each LLM decoding step to adjust the skipped layer set, which involves: (a) Efficient layer set optimization.Swift integrates random search with interval Bayesian optimization to propose layer set candidates; (b) Parallel candidate evaluation.Swift uses LLM-generated tokens (i.e., prior context) as ground truth, enabling simultaneous validation of the proposed candidates. The best-performing layer set is selected to accelerate the current decoding step.
  • Figure 5: Confidence-aware inference process of Swift. (a) The drafting terminates early if the confidence score drops below threshold $\epsilon$. (b) Draft candidates are dynamically selected based on confidence and then verified in parallel by the target LLM.
  • ...and 7 more figures