Table of Contents
Fetching ...

SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models

Fahao Chen, Peng Li, Tom H. Luan, Zhou Su, Jing Deng

TL;DR

Spin addresses the bottlenecks in accelerating LLM inference by integrating heterogeneous speculative models, a learning-based SSM selector, and a holistic runtime design that pipelines speculation and verification. It introduces a learning-based SSM selection mechanism, a fast batch verification technique via request decomposition, and a speculative decoding pipeline that partitions work into micro-batches for GPU-friendly execution. The approach yields substantial throughput gains, achieving around $2.28\times$ improvements over strong baselines across multiple LLMs and datasets, with benefits from improved SSM matching, reduced padding, and better resource utilization. The work demonstrates practical impact for high-throughput LLM serving and provides a blueprint for deploying speculative decoding at scale on heterogeneous GPU clusters.

Abstract

Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently verified by the LLM in a verification phase. However, current state-of-the-art speculative decoding approaches have three key limitations: handling requests with varying difficulty using homogeneous SSMs, lack of robust support for batch processing, and insufficient holistic optimization for both speculation and verification phases. In this paper, we introduce SPIN, an efficient LLM inference serving system based on speculative decoding, designed to address these challenges through three main innovations. First, SPIN improves token speculation by using multiple heterogeneous SSMs, with a learning-based algorithm for SSM selection that operates without prior knowledge of request difficulty. Second, SPIN employs a request decomposition method to minimize batching overhead during LLM verification. Finally, SPIN orchestrates speculation and verification phases by pipelining their executions on GPUs to achieve further acceleration. Experimental results demonstrate that SPIN significantly outperforms state-of-the-art methods, achieving a performance increase of approximately 2.28X.

SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models

TL;DR

Spin addresses the bottlenecks in accelerating LLM inference by integrating heterogeneous speculative models, a learning-based SSM selector, and a holistic runtime design that pipelines speculation and verification. It introduces a learning-based SSM selection mechanism, a fast batch verification technique via request decomposition, and a speculative decoding pipeline that partitions work into micro-batches for GPU-friendly execution. The approach yields substantial throughput gains, achieving around improvements over strong baselines across multiple LLMs and datasets, with benefits from improved SSM matching, reduced padding, and better resource utilization. The work demonstrates practical impact for high-throughput LLM serving and provides a blueprint for deploying speculative decoding at scale on heterogeneous GPU clusters.

Abstract

Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently verified by the LLM in a verification phase. However, current state-of-the-art speculative decoding approaches have three key limitations: handling requests with varying difficulty using homogeneous SSMs, lack of robust support for batch processing, and insufficient holistic optimization for both speculation and verification phases. In this paper, we introduce SPIN, an efficient LLM inference serving system based on speculative decoding, designed to address these challenges through three main innovations. First, SPIN improves token speculation by using multiple heterogeneous SSMs, with a learning-based algorithm for SSM selection that operates without prior knowledge of request difficulty. Second, SPIN employs a request decomposition method to minimize batching overhead during LLM verification. Finally, SPIN orchestrates speculation and verification phases by pipelining their executions on GPUs to achieve further acceleration. Experimental results demonstrate that SPIN significantly outperforms state-of-the-art methods, achieving a performance increase of approximately 2.28X.

Paper Structure

This paper contains 27 sections, 1 theorem, 7 equations, 13 figures, 2 algorithms.

Key Result

Theorem 1

The total regret $\mathcal{R}(T)$ is bounded by $\mathcal{O}(\log_{2}T)$.

Figures (13)

  • Figure 1: Illustration of different speculative decoding approaches.
  • Figure 2: The ratio of different SSMs selected as the best model across requests in three datasets.
  • Figure 3: Results of three random requests using different SSMs.
  • Figure 4: The benefits of speculative decoding with different batch sizes.
  • Figure 5: Illustration of padding tokens in attention computation.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof