Table of Contents
Fetching ...

Tutorial Proposal: Speculative Decoding for Efficient LLM Inference

Heming Xia, Cunxiao Du, Yongqi Li, Qian Liu, Wenjie Li

TL;DR

The paper addresses the latency of LLM inference driven by token-by-token autoregressive generation and the associated memory bottleneck in moving model parameters between high-bandwidth memory and on-chip caches. It surveys Speculative Decoding (SD), a decoding paradigm where a draft model proposes multiple tokens per step and a verifier ensures identical distributions to the target LLM, enabling parallel validation of tokens. Key contributions include a taxonomy of SD methods (independent drafting vs. self-drafting; greedy, speculative sampling, and token-tree verification), a synthesis of cutting-edge algorithms (e.g., Eagle and Eagle-2) and related approaches (Medusa, GliDe with CaPE, Lookahead), and an evaluation framework with downstream applications such as retrieval-augmented, long-context, and multimodal SD. The work discusses practical trade-offs between drafting efficiency and verification, outlines benchmarks for fair comparisons, and highlights future directions like batched inference integration and combining SD with other acceleration techniques, underscoring the potential to significantly accelerate inference while preserving distribution fidelity, $P(y_t|y_{<t},x)$, of the target model.

Abstract

This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative decoding paradigm to mitigate the high inference latency stemming from autoregressive decoding in LLMs. At each decoding step, SD efficiently drafts several future tokens and then verifies them in parallel. This approach, unlike traditional autoregressive decoding, facilitates the simultaneous decoding of multiple tokens per step, thereby achieving promising 2x-4x speedups in LLM inference while maintaining original distributions. This tutorial delves into the latest techniques in SD, including draft model architectures and verification strategies. Additionally, it explores the acceleration potential and future research directions in this promising field. We aim for this tutorial to elucidate the current research landscape and offer insights for researchers interested in Speculative Decoding, ultimately contributing to more efficient LLM inference.

Tutorial Proposal: Speculative Decoding for Efficient LLM Inference

TL;DR

The paper addresses the latency of LLM inference driven by token-by-token autoregressive generation and the associated memory bottleneck in moving model parameters between high-bandwidth memory and on-chip caches. It surveys Speculative Decoding (SD), a decoding paradigm where a draft model proposes multiple tokens per step and a verifier ensures identical distributions to the target LLM, enabling parallel validation of tokens. Key contributions include a taxonomy of SD methods (independent drafting vs. self-drafting; greedy, speculative sampling, and token-tree verification), a synthesis of cutting-edge algorithms (e.g., Eagle and Eagle-2) and related approaches (Medusa, GliDe with CaPE, Lookahead), and an evaluation framework with downstream applications such as retrieval-augmented, long-context, and multimodal SD. The work discusses practical trade-offs between drafting efficiency and verification, outlines benchmarks for fair comparisons, and highlights future directions like batched inference integration and combining SD with other acceleration techniques, underscoring the potential to significantly accelerate inference while preserving distribution fidelity, , of the target model.

Abstract

This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative decoding paradigm to mitigate the high inference latency stemming from autoregressive decoding in LLMs. At each decoding step, SD efficiently drafts several future tokens and then verifies them in parallel. This approach, unlike traditional autoregressive decoding, facilitates the simultaneous decoding of multiple tokens per step, thereby achieving promising 2x-4x speedups in LLM inference while maintaining original distributions. This tutorial delves into the latest techniques in SD, including draft model architectures and verification strategies. Additionally, it explores the acceleration potential and future research directions in this promising field. We aim for this tutorial to elucidate the current research landscape and offer insights for researchers interested in Speculative Decoding, ultimately contributing to more efficient LLM inference.

Paper Structure

This paper contains 30 sections.