Table of Contents
Fetching ...

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou

TL;DR

Group Tree Optimization (GTO) is introduced, which aligns training with the decoding-time tree policy through two components and proves that increasing the Draft Tree Reward provably improves acceptance length and speedup.

Abstract

Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B, Qwen3-8B), GTO increases acceptance length by (7.4%) and yields an additional (7.7%) speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference. Code and draft models are available at https://github.com/hsj576/GTO.

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

TL;DR

Group Tree Optimization (GTO) is introduced, which aligns training with the decoding-time tree policy through two components and proves that increasing the Draft Tree Reward provably improves acceptance length and speedup.

Abstract

Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B, Qwen3-8B), GTO increases acceptance length by (7.4%) and yields an additional (7.7%) speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference. Code and draft models are available at https://github.com/hsj576/GTO.

Paper Structure

This paper contains 34 sections, 2 theorems, 28 equations, 2 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Consider a draft tree $\mathbf{T}_t$ and target model temperature $T \geq 0$. Let $L^{\mathrm{dec}}_T(\mathbf{T}_t)$ denote the expected acceptance length at decoding. Then:

Figures (2)

  • Figure 1: Draft policy misalignment between training and decoding. (a) The tree is built by draft model at decoding: number on edge is the token probability predicted by draft model, e.g., "is" (0.6), and number in parentheses is current path confidence, e.g., "It is" (0.6=$1.0\times 0.6$). Training enforces a training-time greedy draft policy, following the locally best child and yielding the path “It $\rightarrow$ is $\rightarrow$ a” (confidence 0.36). At decoding, top-$4$ re-ranking compares sibling paths, where "It $\rightarrow$ has $\rightarrow$ to" (0.38) outperforms the greedy branch which is thus pruned (red). Training signal concentrated on a single greedy path is wasted when sibling branches win. (b) Target model verifies the tree with its own probabilities. It compares the confidence of each sequence, and accepts the sequence “It $\rightarrow$ is $\rightarrow$ the”. Even when the greedy branch survives, target model may accept a different sibling.
  • Figure 2: Experimental Results of Draft Policy Misalignment between Training and Decoding. (a) Fraction of training-time greedy paths that are pruned during draft tree construction (orange bars) and fraction where the accepted path coincides the greedy path (yellow bars). (b) Accepted greedy paths are also shorter: their average acceptance length is $\mathbf{3\!-\!4}$ tokens, compared to $\mathbf{5\!-\!6}$ for the entire draft tree. (c) Speedup Ratio Comparison of GTO and EAGLE-3.

Theorems & Definitions (4)

  • Theorem 1: Maximizing Draft Tree Reward Guarantees Improved Expected Acceptance Length
  • Lemma 1: Coordinate-wise monotonicity of acceptance probability
  • proof : Proof sketch
  • proof : Proof of \ref{['thm:reward-to-acceptance']}