Table of Contents
Fetching ...

CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

Zhiyuan Ning, Jiawei Shao, Ruge Xu, Xinfei Guo, Jun Zhang, Chi Zhang, Xuelong Li

TL;DR

CAS-Spec tackles the latency challenge of autoregressive LLMs by constructing a hierarchy of on-the-fly draft models using dynamically switchable inference acceleration (DSIA) and coordinating them with Dynamic Tree Cascade (DyTC) for adaptive routing and draft-length decisions. It eliminates the need for training multiple draft models and achieves state-of-the-art acceleration among on-the-fly speculative decoding methods, with average speedups from $1.1\times$ to $2.3\times$ over autoregressive decoding and up to $47\%$–$48\%$ improvements over strong baselines. The approach leverages DSIA strategies like layer sparsity, early exiting, activation sparsity, and quantization to embed draft variants within the target model, while DyTC uses online acceptance-rate estimates and hardware-aware latency predictions to dynamically expand and prune the draft-tree. The results demonstrate practical, training-free acceleration across multiple LLMs and datasets, making CAS-Spec a scalable solution for latency-sensitive deployments with potential for further gains via additional DSIA techniques and hardware-aware optimizations.

Abstract

Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative decoding methods. We introduce a Dynamic Tree Cascade (DyTC) algorithm that adaptively routes the multi-level draft models and assigns the draft lengths, based on the heuristics of acceptance rates and latency prediction. Our CAS-Spec method achieves state-of-the-art acceleration compared to existing on-the-fly speculative decoding methods, with an average speedup from $1.1\times$ to $2.3\times$ over autoregressive decoding across various LLMs and datasets. DyTC improves the average speedup by $47$\% and $48$\% over cascade-based baseline and tree-based baseline algorithms, respectively. CAS-Spec can be easily integrated into most existing LLMs and holds promising potential for further acceleration as self-speculative decoding techniques continue to evolve.

CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

TL;DR

CAS-Spec tackles the latency challenge of autoregressive LLMs by constructing a hierarchy of on-the-fly draft models using dynamically switchable inference acceleration (DSIA) and coordinating them with Dynamic Tree Cascade (DyTC) for adaptive routing and draft-length decisions. It eliminates the need for training multiple draft models and achieves state-of-the-art acceleration among on-the-fly speculative decoding methods, with average speedups from to over autoregressive decoding and up to improvements over strong baselines. The approach leverages DSIA strategies like layer sparsity, early exiting, activation sparsity, and quantization to embed draft variants within the target model, while DyTC uses online acceptance-rate estimates and hardware-aware latency predictions to dynamically expand and prune the draft-tree. The results demonstrate practical, training-free acceleration across multiple LLMs and datasets, making CAS-Spec a scalable solution for latency-sensitive deployments with potential for further gains via additional DSIA techniques and hardware-aware optimizations.

Abstract

Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative decoding methods. We introduce a Dynamic Tree Cascade (DyTC) algorithm that adaptively routes the multi-level draft models and assigns the draft lengths, based on the heuristics of acceptance rates and latency prediction. Our CAS-Spec method achieves state-of-the-art acceleration compared to existing on-the-fly speculative decoding methods, with an average speedup from to over autoregressive decoding across various LLMs and datasets. DyTC improves the average speedup by \% and \% over cascade-based baseline and tree-based baseline algorithms, respectively. CAS-Spec can be easily integrated into most existing LLMs and holds promising potential for further acceleration as self-speculative decoding techniques continue to evolve.

Paper Structure

This paper contains 24 sections, 2 theorems, 8 equations, 3 figures, 2 tables, 2 algorithms.

Key Result

Proposition 4.3

Tree Cascade (TC) assigns the different draft models in the draft token tree to maximize the expected acceptance rate of the early draft tokens.

Figures (3)

  • Figure 1: (a) Comparison of on-the-fly SSD methods (Lookahead, SWIFT) and methods with statistical draft models (e.g. PLD) on Spec-Bench, tested on NVIDIA H100 GPU. (b) Theoretical effective bound of vertical cascade for a draft model $\mathcal{M}_{d_1}$ to be beneficial in the cascade speculative decoding compared with vanilla speculative decoding of $\mathcal{M}_{d_2}$ alone. The x-axis is the expected acceptance rate $\alpha(\mathcal{M}_{t}, \mathcal{M}_{d_1})$ and the y-axis is the cost coefficient $c(\mathcal{M}_{t}, \mathcal{M}_{d_1})$. The SWIFT data points are from the Spec-Bench results for Vicuna-7B-v1.3 model. (The acceptance rates of PLD are between 0.1 and 0.5 in this setting.) (c) Theoretical effective bound of horizontal cascade, similar to (b). In this case, we consider $\alpha_{t,d2}$, which is commonly similar to $\alpha_{t,d2}$ in practice.
  • Figure 2: Illustration a example of the Dynamic Tree Cascade (DyTC) algorithm when $n=3$.
  • Figure 3: Speedup of different methods relative to baseline. AR (1.0) and PLD (1.54) reference lines are shown. The vertical line separates two groups of methods.

Theorems & Definitions (4)

  • Definition 4.1
  • Definition 4.2
  • Proposition 4.3
  • Proposition 4.4