Table of Contents
Fetching ...

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

Rahul Thomas, Teo Kitanovski, Micah Goldblum, Arka Pal

TL;DR

A systematic evaluation of verification strategies across model families, tasks, and sampling regimes finds that Traversal Verification dominates consistently, with OT-based methods lagging far behind, and proposes delayed tree expansion, which drafts a partial single path, delaying the i.i.i.d.d rollouts.

Abstract

Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work has proposed various verification algorithms for i.i.d rollouts, their relative performance under matched settings remains unclear. In this work, we firstly present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes, and find that Traversal Verification dominates consistently, with OT-based methods lagging far behind. Our analysis uncovers that this occurs because OT-based methods achieve high multi-token acceptance near the root of the draft tree, while multi-token gains are most impactful deeper in the draft tree, where draft and target distributions diverge. Based on this insight, we propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point. We show that delayed tree expansion preserves the target distribution and improves on root-node i.i.d rollouts. Further, we develop a dynamic neural selector that estimates the expected block efficiency of optimal-transport-based verification methods from draft and target features, enabling context-dependent expansion decisions. Our neural selector allows OT-based methods like SpecInfer to outperform Traversal Verification for the first time, achieving 5% higher average throughput across a wide range of models, datasets, and sampling settings.

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

TL;DR

A systematic evaluation of verification strategies across model families, tasks, and sampling regimes finds that Traversal Verification dominates consistently, with OT-based methods lagging far behind, and proposes delayed tree expansion, which drafts a partial single path, delaying the i.i.i.d.d rollouts.

Abstract

Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work has proposed various verification algorithms for i.i.d rollouts, their relative performance under matched settings remains unclear. In this work, we firstly present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes, and find that Traversal Verification dominates consistently, with OT-based methods lagging far behind. Our analysis uncovers that this occurs because OT-based methods achieve high multi-token acceptance near the root of the draft tree, while multi-token gains are most impactful deeper in the draft tree, where draft and target distributions diverge. Based on this insight, we propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point. We show that delayed tree expansion preserves the target distribution and improves on root-node i.i.d rollouts. Further, we develop a dynamic neural selector that estimates the expected block efficiency of optimal-transport-based verification methods from draft and target features, enabling context-dependent expansion decisions. Our neural selector allows OT-based methods like SpecInfer to outperform Traversal Verification for the first time, achieving 5% higher average throughput across a wide range of models, datasets, and sampling settings.
Paper Structure (40 sections, 9 equations, 1 figure, 15 tables, 15 algorithms)

This paper contains 40 sections, 9 equations, 1 figure, 15 tables, 15 algorithms.

Figures (1)

  • Figure 1: We generate 200,000+ draft (Llama-3 8B-Instruct) trees from roots of target model (Llama-3 70B-Instruct) trajectories and compute both L1 target-draft distance and average OTLP acceptances across varying draft tree depths. The divergence between target and draft distributions spikes deeper in the tree, and acceptances across all OTLP methods consistently decrease with depth.

Theorems & Definitions (5)

  • Definition 3.1
  • Definition 3.2
  • Definition 5.1
  • Definition 5.2
  • Definition 5.3