Table of Contents
Fetching ...

C2T: A Classifier-Based Tree Construction Method in Speculative Decoding

Feiye Huo, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Shengli Sun

TL;DR

C2T introduces a lightweight classifier-based approach to dynamic token-tree construction in speculative decoding, addressing the limitations of joint-probability confidence. By incorporating features from the token tree such as joint probability $P_i$, entropy $H_i$, and depth $d_i$ into a two-layer FFN, the method reduces the number of tokens the target model must verify by about 25% while preserving or improving acceptance length. Extensive experiments demonstrate strong feature complementarity, good cross-dataset and cross-model transferability within the same family, and notable speedups on large LLMs (e.g., ~18% wall-clock time reduction at the same acceptance length). C2T remains plug-and-play with existing SD frameworks, and shows practical benefits in chain-mode scenarios, though it does not yet support batch sizes greater than one in its current form.

Abstract

The growing scale of Large Language Models (LLMs) has exacerbated inference latency and computational costs. Speculative decoding methods, which aim to mitigate these issues, often face inefficiencies in the construction of token trees and the verification of candidate tokens. Existing strategies, including chain mode, static tree, and dynamic tree approaches, have limitations in accurately preparing candidate token trees for verification. We propose a novel method named C2T that adopts a lightweight classifier to generate and prune token trees dynamically. Our classifier considers additional feature variables beyond the commonly used joint probability to predict the confidence score for each draft token to determine whether it is the candidate token for verification. This method outperforms state-of-the-art (SOTA) methods such as EAGLE-2 on multiple benchmarks, by reducing the total number of candidate tokens by 25% while maintaining or even improving the acceptance length.

C2T: A Classifier-Based Tree Construction Method in Speculative Decoding

TL;DR

C2T introduces a lightweight classifier-based approach to dynamic token-tree construction in speculative decoding, addressing the limitations of joint-probability confidence. By incorporating features from the token tree such as joint probability , entropy , and depth into a two-layer FFN, the method reduces the number of tokens the target model must verify by about 25% while preserving or improving acceptance length. Extensive experiments demonstrate strong feature complementarity, good cross-dataset and cross-model transferability within the same family, and notable speedups on large LLMs (e.g., ~18% wall-clock time reduction at the same acceptance length). C2T remains plug-and-play with existing SD frameworks, and shows practical benefits in chain-mode scenarios, though it does not yet support batch sizes greater than one in its current form.

Abstract

The growing scale of Large Language Models (LLMs) has exacerbated inference latency and computational costs. Speculative decoding methods, which aim to mitigate these issues, often face inefficiencies in the construction of token trees and the verification of candidate tokens. Existing strategies, including chain mode, static tree, and dynamic tree approaches, have limitations in accurately preparing candidate token trees for verification. We propose a novel method named C2T that adopts a lightweight classifier to generate and prune token trees dynamically. Our classifier considers additional feature variables beyond the commonly used joint probability to predict the confidence score for each draft token to determine whether it is the candidate token for verification. This method outperforms state-of-the-art (SOTA) methods such as EAGLE-2 on multiple benchmarks, by reducing the total number of candidate tokens by 25% while maintaining or even improving the acceptance length.

Paper Structure

This paper contains 31 sections, 3 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Two illustrations of EAGLE-2 for verifying with 4 candidate tokens. Blue represents the chosen candidate tokens, red represents the tokens that were not chosen, bold text represents the correct answers, numbers on the arrows represent the generation probabilities, and C represents confidence, which in EAGLE-2 refers to joint probability.
  • Figure 2: The coordinates in the three heatmaps have the same meanings: The y-axis represents the entropy value interval in which the current probability distribution lies. From top to bottom, these intervals are 0$\sim$1, 1$\sim$2, 2$\sim$3, 3$\sim$4, 4$\sim$5, 5$\sim$6, and $>$6. The x-axis shows the top 20 probabilities within the distribution, decreasing from left to right. Each square in the heatmap indicates the value corresponding to the probability rank within the respective entropy interval. In Figure \ref{['fig:e10']}, the value represents the average probability at each position, smoothed using a logarithm. In Figure \ref{['fig:e11']}, the value represents the accept rate at each position, also smoothed with a logarithm. In Figure \ref{['fig:e12']}, the value shows the bias between probability and accept rate, with red indicating probabilities higher than accept rates and blue indicating the opposite.
  • Figure 3: This figure provides further details on C2T. The classifier is a two-layer FFN, represented by the rounded rectangle labeled "Cls". It uses the joint probability $P$, the entropy $H$, and its depth $d$ as features, which are depicted as the larger yellow circles, where bold text represents the correct answers. The classifier outputs a logit as the confidence score $C$, shown as the smaller circle, where blue represents the candidate tokens, and red represents the tokens that were not recalled. Then a threshold $\beta$ = 0.5 and Top$K$ = 2 is used for pre-pruning. Tokens above $\beta$ will be used as input for the next round of draft model $M_d$ to generate the next tree layer. The features for the next layer of tokens can be obtained from the generation probability $p$ output by the draft model $M_d$.
  • Figure 4: C2T
  • Figure 5: The scatter plot uses the LLaMA-2 7B model, with the acceptance length $\tau$ on the x-axis and the number of candidate tokens $\gamma$ on the y-axis. Cls represents our classifier-based method C2T, E2 represents EAGLE-2, w/o represents not using topK secondary pruning, and w$K$ represents the use of Top$K$ secondary pruning with $K$ values of 15, 20, 25, and 30.
  • ...and 7 more figures

Theorems & Definitions (3)

  • proof : Proof-1
  • proof : Proof-2
  • proof : Proof-3