C2T: A Classifier-Based Tree Construction Method in Speculative Decoding
Feiye Huo, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Shengli Sun
TL;DR
C2T introduces a lightweight classifier-based approach to dynamic token-tree construction in speculative decoding, addressing the limitations of joint-probability confidence. By incorporating features from the token tree such as joint probability $P_i$, entropy $H_i$, and depth $d_i$ into a two-layer FFN, the method reduces the number of tokens the target model must verify by about 25% while preserving or improving acceptance length. Extensive experiments demonstrate strong feature complementarity, good cross-dataset and cross-model transferability within the same family, and notable speedups on large LLMs (e.g., ~18% wall-clock time reduction at the same acceptance length). C2T remains plug-and-play with existing SD frameworks, and shows practical benefits in chain-mode scenarios, though it does not yet support batch sizes greater than one in its current form.
Abstract
The growing scale of Large Language Models (LLMs) has exacerbated inference latency and computational costs. Speculative decoding methods, which aim to mitigate these issues, often face inefficiencies in the construction of token trees and the verification of candidate tokens. Existing strategies, including chain mode, static tree, and dynamic tree approaches, have limitations in accurately preparing candidate token trees for verification. We propose a novel method named C2T that adopts a lightweight classifier to generate and prune token trees dynamically. Our classifier considers additional feature variables beyond the commonly used joint probability to predict the confidence score for each draft token to determine whether it is the candidate token for verification. This method outperforms state-of-the-art (SOTA) methods such as EAGLE-2 on multiple benchmarks, by reducing the total number of candidate tokens by 25% while maintaining or even improving the acceptance length.
