Table of Contents
Fetching ...

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Tao Jin, Phuong Minh Nguyen, Naoya Inoue

Abstract

Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Abstract

Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.

Paper Structure

This paper contains 51 sections, 4 theorems, 9 equations, 6 figures, 6 tables, 3 algorithms.

Key Result

Proposition 1

Under the heterogeneous model with spine length $m \geq 1$ and branch widths $\{w_i\}_{i=0}^{m-1}$ (total budget $m + \sum_i w_i \leq B$), the expected accepted path length satisfies: with $\phi_i$ and $\bar{\ell}$ as defined above. The bound is tight when the branches are independent chains. $\blacktriangleleft$$\blacktriangleleft$

Figures (6)

  • Figure 1: Speculation-tree topologies ($B$ fixed). (a)PLD spine (linear). (b) EAGLE-2 pruned tree. (c) Isotropic (uniform-rate) tree. (d)Goose anisotropic spine tree (spine ratio adapts per cycle; \ref{['sec:consensus']}).
  • Figure 2: Goose pipeline overview. From anchor ., Stage 1 draws from a unified candidate pool: context matching produces the spine (blue), the adjacency table supplies branches (orange). Stage 2 verifies all candidates via one LLM forward pass. Stage 3 selects the longest accepted path via a greedy walk, discovering spine continuation: "2"$\,\to\,$")" extends beyond the spine mismatch, yielding 8 tokens per call.
  • Figure 3: Wall-clock speedup over autoregressive (AR) decoding across five models and five benchmarks. All methods are lossless (greedy decoding, identical output). Speedup values for Goose are annotated above each bar. Full per-benchmark results including compression ratio ($\tau$) are reported in \ref{['tab:main-full']}. $^\dagger$Vicuna-33B runs on 2$\times$A100-40GB; all others on a single A40.
  • Figure 4: (a)Goose's mean $\tau$ decomposed into the best standalone baseline $\max(\tau_{\textsc{PLD}{}},\tau_{\textsc{TR}{}})$ (gray) and the synergy gain from the spine-tree topology (green); the gain ranges from $+16\%$ (Vicuna-13B) to $+45\%$ (Llama-3-8B). (b) Hyperparameter sensitivity (Qwen3-8B): each panel sweeps one parameter; $\tau$ plateaus at $B{\geq}60$ and $D{\geq}6$, and varies by fewer than 0.6 $\tau$ units across $r$ and $\rho$.
  • Figure 5: Acceptance heterogeneity across five benchmarks.(a) Mean $\hat{p}_s$ and $\hat{p}_t$ per benchmark (averaged over 5 models); the ratio $N\times$ above each bar pair quantifies the gap. (b) Heterogeneity ratio $\hat{p}_s/\hat{p}_t$ per model$\times$benchmark (25 settings).
  • ...and 1 more figures

Theorems & Definitions (7)

  • Proposition 1: Spine Tree Expected Yield
  • Proposition 2: Optimal Branch Allocation
  • Proposition 3: Spine Tree Dominance
  • Proposition 4: Non-Degradation Guarantee
  • proof
  • proof
  • proof