Table of Contents
Fetching ...

Training-free Dropout Sampling for Semantic Token Acceptance in Speculative Decoding

Jeongtae Lee, Minjung Jo, Hyunjoon Jeong, Gunho Park, Sunghyeon Woo, Joonghoon Kim, Se Jung Kwon, Dongsoo Lee

TL;DR

This work introduces DropMatch, a novel approach that matches draft tokens to the predictive distribution of the target model via Monte Carlo dropout applied exclusively to the LM head, enabling sampling-based acceptance decisions.

Abstract

Speculative decoding accelerates large language model inference by proposing tokens with a lightweight draft model and selectively accepting them using a target model. This work introduces DropMatch, a novel approach that matches draft tokens to the predictive distribution of the target model via Monte Carlo dropout applied exclusively to the LM head, enabling sampling-based acceptance decisions. By generating multiple decoding paths, our method forms an empirical token distribution against which draft tokens are evaluated for consistency. This acceptance mechanism enables the model to adaptively control the size of decoding paths under an appropriate dropout probability, preventing substantial distortion of the target model predictive distribution. The proposed method operates in a training-free, data-free, and calibration-free manner, requires no architectural modification to pretrained models, and can be orthogonally integrated with a wide range of existing speculative decoding and inference acceleration techniques. Experiments across multiple benchmarks demonstrate that our approach increases acceptance length while maintaining competitive task performance, yielding inference speedups ranging from 1.09x to 1.33x over the standard baseline, and up to an additional 1.09x speedup when applied on top of EAGLE3.

Training-free Dropout Sampling for Semantic Token Acceptance in Speculative Decoding

TL;DR

This work introduces DropMatch, a novel approach that matches draft tokens to the predictive distribution of the target model via Monte Carlo dropout applied exclusively to the LM head, enabling sampling-based acceptance decisions.

Abstract

Speculative decoding accelerates large language model inference by proposing tokens with a lightweight draft model and selectively accepting them using a target model. This work introduces DropMatch, a novel approach that matches draft tokens to the predictive distribution of the target model via Monte Carlo dropout applied exclusively to the LM head, enabling sampling-based acceptance decisions. By generating multiple decoding paths, our method forms an empirical token distribution against which draft tokens are evaluated for consistency. This acceptance mechanism enables the model to adaptively control the size of decoding paths under an appropriate dropout probability, preventing substantial distortion of the target model predictive distribution. The proposed method operates in a training-free, data-free, and calibration-free manner, requires no architectural modification to pretrained models, and can be orthogonally integrated with a wide range of existing speculative decoding and inference acceleration techniques. Experiments across multiple benchmarks demonstrate that our approach increases acceptance length while maintaining competitive task performance, yielding inference speedups ranging from 1.09x to 1.33x over the standard baseline, and up to an additional 1.09x speedup when applied on top of EAGLE3.
Paper Structure (18 sections, 7 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 18 sections, 7 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Out-of-distribution performance of Auto-Judge on IFEval and EAGLE3 on KoMT-Bench. (a) Results of Auto-Judge using a judge head trained with the Llama-3.1-70B-Instruct model under out-of-distribution conditions, exhibiting preserved acceptance length but degraded task performance. (b) Results of EAGLE3 using a draft model trained with the Llama-3.3-70B-Instruct model, demonstrating that task performance is maintained while acceptance length decreases on KoMT-Bench, leading to reduced acceleration benefits. $\tau$ denotes the mean acceptance length.
  • Figure 2: Overall architecture of DropMatch, illustrating speculative decoding with multiple sampling enabled by MC dropout applied at the LM head. $d_t$ denotes the $t$-th draft token, and $h_t$ denotes its corresponding final embedding vector. $h_t^{(i)},\dots,h_t^{(K)}$ represent $K$ MC dropout paths generated by applying $K$ different dropout masks to the $t$-th embedding.
  • Figure 3: Semantic similarity across multiple decoding paths. (a) Cosine similarity matrices computed with Sentence-BERT and semantic consistency matrices from a sentence entailment model at dropout probability $p_{drop}=0.1$, (b) Corresponding results at $p_{drop}=0.3$, showing that lower dropout probabilities yield higher semantic similarity across paths. H1--H5 denote the MC dropout with $K=5$ decoding paths, each corresponding to a distinct stochastic forward pass through the LM head. The higher the value, the darker the color.
  • Figure 4: Conceptual illustration of the JS-divergence–based acceptance criterion. (a) Acceptance determined solely by Eq. \ref{['eq5']} under dispersed MC dropout sample distributions. (b) Acceptance determined by Eq. \ref{['eq6']} under highly concentrated sample distributions. Both subfigures illustrate acceptance and rejection cases.
  • Figure 5: Comparison of Auto-Judge and Auto-Judge combined with DropMatch(DM) on GSM8K 8shot with Llama-3.1-8B/70B-Instruct models. Accuracy and mean acceptance length graphs at dropout probabilities $p_{drop}=0.2$ and $0.3$ with $K=5$ MC dropout paths. A rightward shift of Auto-Judge + DM relative to Auto-Judge indicates increased acceptance length at comparable accuracy levels.
  • ...and 1 more figures