Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

Rahul Thomas; Teo Kitanovski; Micah Goldblum; Arka Pal

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

Rahul Thomas, Teo Kitanovski, Micah Goldblum, Arka Pal

TL;DR

A systematic evaluation of verification strategies across model families, tasks, and sampling regimes finds that Traversal Verification dominates consistently, with OT-based methods lagging far behind, and proposes delayed tree expansion, which drafts a partial single path, delaying the i.i.i.d.d rollouts.

Abstract

Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work has proposed various verification algorithms for i.i.d rollouts, their relative performance under matched settings remains unclear. In this work, we firstly present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes, and find that Traversal Verification dominates consistently, with OT-based methods lagging far behind. Our analysis uncovers that this occurs because OT-based methods achieve high multi-token acceptance near the root of the draft tree, while multi-token gains are most impactful deeper in the draft tree, where draft and target distributions diverge. Based on this insight, we propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point. We show that delayed tree expansion preserves the target distribution and improves on root-node i.i.d rollouts. Further, we develop a dynamic neural selector that estimates the expected block efficiency of optimal-transport-based verification methods from draft and target features, enabling context-dependent expansion decisions. Our neural selector allows OT-based methods like SpecInfer to outperform Traversal Verification for the first time, achieving 5% higher average throughput across a wide range of models, datasets, and sampling settings.

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

TL;DR

Abstract

Paper Structure (40 sections, 9 equations, 1 figure, 15 tables, 15 algorithms)

This paper contains 40 sections, 9 equations, 1 figure, 15 tables, 15 algorithms.

Introduction
Background
Performance Comparisons
Improving Drafting
Context-dependent tree structures.
Offline tree optimization for block efficiency.
Dynamic draft length control.
Training-based tree policy.
Hardware-Aware Tree Decoding.
Verification Algorithms
Single-Path Algorithms
Naive speculative sampling.
Tree verification.
Block verification (BV).
Multi-Path Algorithms
...and 25 more sections

Figures (1)

Figure 1: We generate 200,000+ draft (Llama-3 8B-Instruct) trees from roots of target model (Llama-3 70B-Instruct) trajectories and compute both L1 target-draft distance and average OTLP acceptances across varying draft tree depths. The divergence between target and draft distributions spikes deeper in the tree, and acceptances across all OTLP methods consistently decrease with depth.

Theorems & Definitions (5)

Definition 3.1
Definition 3.2
Definition 5.1
Definition 5.2
Definition 5.3

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

TL;DR

Abstract

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

Authors

TL;DR

Abstract

Table of Contents

Figures (1)

Theorems & Definitions (5)